ab.txt
30 abcd
45 sadsad
ac.txt
50 ab
430 sadsad
how can i join these so that i can get
abc.txt
30 abcd
50 ab
475 sadsad
Strings in second column are unique though
i tried reading line by line from first txt file and reading the string and searching this with grep in second file and then getting the number from it , in same way reading line by line from second file and get the string which is not in first
but this process is too tiresome and worst possible way,
any nice procedure doing it with join easily?
This will work with GNU awk:
awk '{ sum[$2] += $1 } END { for (name in sum) print sum[name] " " name }' a b |
sort -n
Explanation:
As each line is read from the files, the variables $1 and $2 contain the first and second fields (that is, the value and the string). The associative array sum uses the strings as its indices, so it accumulates the values associated with each string. Then, at the end of input (after the last file is read), it loops through the name of each string that's an index in sum, printing its total and its name. Finally, it sorts the output numerically.
Related
I am parsing a sql script using unix. If FROM is a first word then merge it with previous line. If FROM is last word in line then we need to merge it with next line. E.g.:
A
FROM
B
I want the result as
A FROM B
avoid any space and tabs.
Code:
cat A.txt | sed ':a;N;$!ba;s|[Ff][Rr][Oo][Mm][\s\t]*\n|FROM |g;s/\n\s*\t*[Ff][Rr][Oo][Mm]/ FROM/g' >B.txt
Here is one using GNU awk and gensub. It replaces combination of spaces, newlines and tabs (carriage return omitted due to unix tag) before and after the word FROM. It uses empty RS as record separator, meaning that a record ends in an empty line or the end of file.
$ awk 'BEGIN{RS=""}{$0=gensub(/[ \t\n]+(FROM)[ \t\n]+/," \\1 ","g")}1' file
A FROM B
If you just want the word that comes after FROM:
$ awk 'BEGIN{RS=""}{for(i=1;i<=NF;i++)if($i=="FROM")print $(i+1)}' file
B
Both will fail if your query has FROM in the WHERE part values, like:
SELECT * FROM table WHERE variable='DEATH COMES FROM ABOVE';
I am learning awk and I am having a hard time trying to do this :
I have a file, let's name it pattern_file.txt, which contains multiple patterns, one per line. For example, it looks like this :
pattern_file.txt
PATTERN1
PATTERN2
PATTERN3
PATTERN4
I have a second file, containing some text. Let's name it text_file.txt. It looks like this:
text_file.txt
xxxxxxxxPATTERN1xxxxxxx
yyyyyPATTERN2yyyy
zzzzzzzzzPATTERN3zzzzzz
What I am trying to do is : If one of the patterns in the pattern_file.txt is present in the current line read in text_file.txt, print the line.
I know how to print a line with awk, what gives me a hard time is to use the pattern stored in the pattern_file.txt and verify if one of them is present.
In awk using index
awk 'NR==FNR{a[$1];next}{for(i in a)if(index($0,i)) print}' pattern text
xxxxxxxxPATTERN1xxxxxxx
yyyyyPATTERN2yyyy
zzzzzzzzzPATTERN3zzzzzz
Store "patterns" to a hash and for each record use index to try to find the "patterns" from the record.
A variation of this helpful James Brown's answer using match() which also does regex match (and) returns the starting index of the matching string,
awk 'FNR==NR{a[$0]; next}{for (i in a) if (match($0,i)) print}' pattern_file.txt text_file.txt
which returns me the lines needed.
On printing the return values from the match() function
awk 'FNR==NR{a[$0]; next}{for (i in a) if (match($0,i)) print RSTART}' pattern_file.txt text_file.txt
gives an output as
9 # Meaning 'PATTERN1' match started at index 9 in 'xxxxxxxxPATTERN1xxxxxxx'
6
10
I have two very large files in Unix each containing say 5 columns but millions of lines.
Ex :
File 1:
abc|def|ghk|ijk|lmn
....
...
File2 :
abc|def|ghk|ijk|123
...
...
My task is to compare the two large files and find out the differing columns and rows . For instance the output would be :
Column-no Row-no File1-word File2-word.
Ex :
5 1 lmn 123
The files are really large . I have heard awk is the fastest way to do file parsing in Unix. Since the output can't wait for too long.
Can this be done using awk?
paste/awk solution
$ paste -d'|' file1 file2 |
awk -F'|' '{w=NF/2;
for(i=1;i<=w;i++)
if($i!=$(i+w)) printf "%d %d %s %s", NR,i,$i,$(i+w);
print ""}'
1 5 lmn 123
I changed the order, it makes more sense to me to print the line number first then field number, however you can change it easily...
Once paste matches lines from two files go over field of the first half (first file) and compare with the second half (second file) and print the differences. awk has the implicit loop to over all records (lines). I haven't tested this with large files but for awk part it doesn't matter (record by record). I'm not sure how eager paste is but I doubt it will blink.
In UNIX, I would like to filter my 3 columns file based on the "DP" values that within the 3rd column.
I'd like to obtain only rows that have DP values higher than 7.
A|49.14|AC=2;AF=0.500;AN=4;BaseQRankSum=1.380;DP=6;Dels=0.00;
T|290.92|AC=2;AF=1.00;AN=2;DP=8;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;
T|294.75|AC=6;AF=1.00;AN=6;DP=9;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=6;
I'm using here "|" for separating between my three columns
Here's one simple solution
$ echo "A|49.14|AC=2;AF=0.500;AN=4;BaseQRankSum=1.380;DP=6;Dels=0.00;
AC=6;AF=1.00;AN=6;DP=9;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=6;T|290.92|AC=2;AF=1.00;AN=2;DP=8;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;
MLEAC=6;" \
| awk '{dpVal=$0;sub(/.*DP=/, "", dpVal);sub(/;.*$/,"", dpVal); if (dpVal>7) print}'
output
T|290.92|AC=2;AF=1.00;AN=2;DP=8;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;
T|294.75|AC=6;AF=1.00;AN=6;DP=9;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=6;
This makes a copy of each line ($0), the strips away everything before DP=, and everything after the trailing ; char for that field, leaving just the value for DP. That value is tested, and if true the whole line is printed (the default action of awk print is to print the whole line, but you can tell it to print anything you like, maybe print "Found it:" $0 or zillons of variants.
edit
I would like to keep all the first 53 lines intact and save them as well to my Output.txt file.
Yes, very easy, you're on the right track. With awk is is very easy to have multiple conditions process different parts or conditions in a file. Try this:
awk 'FNR <= 53 {print}
FNR > 53 {
vpVal=$0;sub(/.*DP=/, "", dpVal);sub(/;.*$/,"", dpVal)
if (dpVal>7) print
}' File.vcf > Output.txt
(i don't have a file to test with, so let me know if this isn't right).
IHTH
Have a data set where the max number of records in one file is ~ 130,000.
Here is a subset of the first file, 1.txt:
CID|UID|Key|sis_URL
1|D000108|RDHQFKQIGNGIED|http://sis.gov/regno=0000870779
1|D000108|RDHQFKQIGNGIED|http://sis.gov/regno=0014992622
1|D000644|RDHQFKQIGNGIED|http://sis.gov/regno=0000870779
1|D000644|RDHQFKQIGNGIED|http://sis.gov/regno=0014992622
1|D002331|RDHQFKQIGNGIED|http://sis.gov/regno=0000870779
1|D002331|RDHQFKQIGNGIED|http://sis.gov/regno=0014992622
11|C024565|WSLDOOZREJYCGB|http://sis.gov/regno=0000107062
13|C009947|PBKONEOXTCPAFI|http://sis.gov/regno=0000120821
13|C009947|PBKONEOXTCPAFI|http://sis.gov/regno=0063697187
Here is a subset of the second file, 2.txt:
CID|bro_URL
11|http://bro.gov/nmbr=0149
13|http://bro.gov/nmbr=0119
Am running gnuwin32 under Windows 7, 64 bit with 8gb memory; therefore need to use double quote for windows. The join command is:
join -t"|" -1 1 -2 1 -a1 -a2 -e "NULL" -o "0,1.2,1.3,1.4,2.2" 1.txt 2.txt > 3_.txt
Here is the output file, 3.txt.
CID|UID|Key|sis_URL|bro_URL
1|D000108|RDHQFKQIGNGIED|http://sis.gov/regno=0000870779|NULL
1|D000108|RDHQFKQIGNGIED|http://sis.gov/regno=0014992622|NULL
1|D000644|RDHQFKQIGNGIED|http://sis.gov/regno=0000870779|NULL
1|D000644|RDHQFKQIGNGIED|http://sis.gov/regno=0014992622|NULL
1|D002331|RDHQFKQIGNGIED|http://sis.gov/regno=0000870779|NULL
1|D002331|RDHQFKQIGNGIED|http://sis.gov/regno=0014992622|NULL
11|NULL|NULL|NULL|http://bro.gov/nmbr=0149
13|NULL|NULL|NULL|http://bro.gov/nmbr=0119
11|C024565|WSLDOOZREJYCGB|http://sis.gov/regno=0000107062|NULL
13|C009947|PBKONEOXTCPAFI|http://sis.gov/regno=0000120821|NULL
13|C009947|PBKONEOXTCPAFI|http://sis.gov/regno=0063697187|NULL
For CID:11 and CID:13, I am expecting:
11|C024565|WSLDOOZREJYCGB|http://sis.gov/regno=0000107062|http://bro.gov/nmbr=0149
13|C009947|PBKONEOXTCPAFI|http://sis.gov/regno=0000120821|http://bro.gov/nmbr=0119
13|C009947|PBKONEOXTCPAFI|http://sis.gov/regno=0063697187|http://bro.gov/nmbr=0119
Why does the join on CID:11 and CID:13 fail?
Note: before posting this question I ran the subset above and produced the proper results. When I run the complete set, I get the improper result (the subset shown here).
Any idea why? Any recommended alternative?
When I've completed the join process, my final table will be 15 columns wide. But I'm already stymied at column 4.
Any proposed work-around, such as awk?
You can try the following command:
awk -f a.awk 2.txt 1.txt > 3.txt
where a.awk is:
BEGIN { FS=OFS="|" }
NR==FNR{
a[$1]=$2
next
}
{
if ($1 in a)
$(NF+1)=a[$1]
else
$(NF+1)="NULL"
print
}
with output:
CID|UID|Key|sis_URL|bro_URL
1|D000108|RDHQFKQIGNGIED|http://sis.gov/regno=0000870779
1|D000108|RDHQFKQIGNGIED|http://sis.gov/regno=0014992622
1|D000644|RDHQFKQIGNGIED|http://sis.gov/regno=0000870779
1|D000644|RDHQFKQIGNGIED|http://sis.gov/regno=0014992622
1|D002331|RDHQFKQIGNGIED|http://sis.gov/regno=0000870779
1|D002331|RDHQFKQIGNGIED|http://sis.gov/regno=0014992622
11|C024565|WSLDOOZREJYCGB|http://sis.gov/regno=0000107062|http://bro.gov/nmbr=0149
13|C009947|PBKONEOXTCPAFI|http://sis.gov/regno=0000120821|http://bro.gov/nmbr=0119
13|C009947|PBKONEOXTCPAFI|http://sis.gov/regno=0063697187|http://bro.gov/nmbr=0119
Explanation
We read the data in 2.txt into the associative array a
The test NR==FNR is used to match only the first file on the command line, that is the file 2.txt
The next statement is important so that the next rules are not executed for 2.txt
The second rule (the one containing the if test) is then executed only for 1.txt but the information from 2.txt is still available through the array a
If the first field matches a field in the first column in 2.txt, that is: if ($1 in a), then we insert that value at the end of the line (NF is the number of fields (that is: columns) read from 1.txt)
If there is no match, we insert the string "NULL"