I have two very large files in Unix each containing say 5 columns but millions of lines.
Ex :
File 1:
abc|def|ghk|ijk|lmn
....
...
File2 :
abc|def|ghk|ijk|123
...
...
My task is to compare the two large files and find out the differing columns and rows . For instance the output would be :
Column-no Row-no File1-word File2-word.
Ex :
5 1 lmn 123
The files are really large . I have heard awk is the fastest way to do file parsing in Unix. Since the output can't wait for too long.
Can this be done using awk?
paste/awk solution
$ paste -d'|' file1 file2 |
awk -F'|' '{w=NF/2;
for(i=1;i<=w;i++)
if($i!=$(i+w)) printf "%d %d %s %s", NR,i,$i,$(i+w);
print ""}'
1 5 lmn 123
I changed the order, it makes more sense to me to print the line number first then field number, however you can change it easily...
Once paste matches lines from two files go over field of the first half (first file) and compare with the second half (second file) and print the differences. awk has the implicit loop to over all records (lines). I haven't tested this with large files but for awk part it doesn't matter (record by record). I'm not sure how eager paste is but I doubt it will blink.
Related
I just started learning awk and sed after my first 2 questions in stackOverflow. Thanks to Roman,Hek,Randomnir, Edmorton and many who corrected and helped me wholeheartedly.
Right now i could do changes to awk commands to suit my data requirements. I just need some help from all out here.I'm taking baby steps to fix all unix errors on my own.Some advice will be helpful
My data -
ID | Passcode
41-1|10551
1-105|5569
4-7|10043
78-3|217631
3-1|19826
12-1|19818912
My output has to be
ID | Passcode
41-1|10551
4-7|10043
78-3|217631
3-1|19826
12-1|19818912
All the records from the 2nd col which is less than 5char must be deleted or filtered . My o/p file should have only 5chars length or above Passcodes only.
It is pretty simple, just use the length() function to get words of length greater than or equal to 5 after setting the input and output field separators to |
awk 'BEGIN{FS=OFS="|"} NR==1 || length($2)>=5' file
I'm trying to work out a generic script for getting the tab separated column values (as a header to the file). Separating via awk is fine for getting the column names. I'm not getting the idea as how to get each tab separated values until the end or the last NF(if using awk). The number of columns in the file isn't fix, sometime it might come up with 20 columns, sometime it could be 100, etc.
For ex: the tab separated columns in the file are-
abc ttr nnc r32 inc ...
If I write a simple awk as:
head -1 file | awk 'BEGIN {NF="\t"} {print $1,$2, etc}
It'd present each tab separated column represented by $1, $2, etc. I tried an incremental version by replacing $1, $2, etc by $i but wouldn't work.
Any ideas on this?
If I understand correctly, you are asking how to loop over the fields from 1 to NF. Here is an example of such a loop:
$ head -1 file | awk -F"\t" '{for (i=1;i<=NF;i++)printf "%s ",$i; print"";}'
abc ttr nnc r32 inc
iam searching for some command line that takes a text file and a file with line numbers (one on each line) (alternatively from stdin) and outputs only that lines from the first file.
the text file may be several hundreds of MB large and the line list may contains several thousands of entries (but are sorted ascending)
in short:
one file contains data
another file contains indexes
a command should extract only indexed lines
first file:
many lines
of course they are all very different
and contain very important data
...
more lines
...
even more lines
second file
1
5
7
expected output
many lines
more lines
even more lines
The second (line number) file does not necessarily have to exist. Its data also may come from stdin (in deed this would the optimum). Also the format of that data may vary from the shown if this would make the task easier.
This can be an approach:
$ awk 'FNR==NR {a[$1]; next} FNR in a' file_with_line_numbers file_with_data
many lines
more lines
even more lines
It reads the file_with_line_numbers and stores the lines in an array a[]. Then it reads the other file and keeps checking if the line number is in the array, in which case the line is printed.
The trick used is the following:
awk 'FNR==NR {something; next} {other things}' file1 file2
that performs actions related to file1 in the {something} block and then actions related to file2 in the {other things} block.
What if the line numbers are given through stdin?
For this you can use awk '...' - file, so that stdin is called with -. This is called Naming Standard Input. So that you can do:
your_commands | awk 'FNR==NR {a[$1]; next} FNR in a' - file_with_data
Test
$ echo "1
5
7" | awk 'FNR==NR {a[$1]; next} FNR in a' - file_with_data
many lines
more lines
even more lines
With sed, convert the line numbers to a sed program, and use that generated program to print out the wanted lines;
$ sed -n "$( sed 's/$/p/' second_file )" first_file
many lines
more lines
even more lines
This works too.
foreach line ( "cat file2" )
foreach? sed -n "$line p" file1
foreach? end
many lines
more lines
even more lines
In UNIX, I would like to filter my 3 columns file based on the "DP" values that within the 3rd column.
I'd like to obtain only rows that have DP values higher than 7.
A|49.14|AC=2;AF=0.500;AN=4;BaseQRankSum=1.380;DP=6;Dels=0.00;
T|290.92|AC=2;AF=1.00;AN=2;DP=8;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;
T|294.75|AC=6;AF=1.00;AN=6;DP=9;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=6;
I'm using here "|" for separating between my three columns
Here's one simple solution
$ echo "A|49.14|AC=2;AF=0.500;AN=4;BaseQRankSum=1.380;DP=6;Dels=0.00;
AC=6;AF=1.00;AN=6;DP=9;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=6;T|290.92|AC=2;AF=1.00;AN=2;DP=8;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;
MLEAC=6;" \
| awk '{dpVal=$0;sub(/.*DP=/, "", dpVal);sub(/;.*$/,"", dpVal); if (dpVal>7) print}'
output
T|290.92|AC=2;AF=1.00;AN=2;DP=8;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;
T|294.75|AC=6;AF=1.00;AN=6;DP=9;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=6;
This makes a copy of each line ($0), the strips away everything before DP=, and everything after the trailing ; char for that field, leaving just the value for DP. That value is tested, and if true the whole line is printed (the default action of awk print is to print the whole line, but you can tell it to print anything you like, maybe print "Found it:" $0 or zillons of variants.
edit
I would like to keep all the first 53 lines intact and save them as well to my Output.txt file.
Yes, very easy, you're on the right track. With awk is is very easy to have multiple conditions process different parts or conditions in a file. Try this:
awk 'FNR <= 53 {print}
FNR > 53 {
vpVal=$0;sub(/.*DP=/, "", dpVal);sub(/;.*$/,"", dpVal)
if (dpVal>7) print
}' File.vcf > Output.txt
(i don't have a file to test with, so let me know if this isn't right).
IHTH
ab.txt
30 abcd
45 sadsad
ac.txt
50 ab
430 sadsad
how can i join these so that i can get
abc.txt
30 abcd
50 ab
475 sadsad
Strings in second column are unique though
i tried reading line by line from first txt file and reading the string and searching this with grep in second file and then getting the number from it , in same way reading line by line from second file and get the string which is not in first
but this process is too tiresome and worst possible way,
any nice procedure doing it with join easily?
This will work with GNU awk:
awk '{ sum[$2] += $1 } END { for (name in sum) print sum[name] " " name }' a b |
sort -n
Explanation:
As each line is read from the files, the variables $1 and $2 contain the first and second fields (that is, the value and the string). The associative array sum uses the strings as its indices, so it accumulates the values associated with each string. Then, at the end of input (after the last file is read), it loops through the name of each string that's an index in sum, printing its total and its name. Finally, it sorts the output numerically.