Have a data set where the max number of records in one file is ~ 130,000.
Here is a subset of the first file, 1.txt:
CID|UID|Key|sis_URL
1|D000108|RDHQFKQIGNGIED|http://sis.gov/regno=0000870779
1|D000108|RDHQFKQIGNGIED|http://sis.gov/regno=0014992622
1|D000644|RDHQFKQIGNGIED|http://sis.gov/regno=0000870779
1|D000644|RDHQFKQIGNGIED|http://sis.gov/regno=0014992622
1|D002331|RDHQFKQIGNGIED|http://sis.gov/regno=0000870779
1|D002331|RDHQFKQIGNGIED|http://sis.gov/regno=0014992622
11|C024565|WSLDOOZREJYCGB|http://sis.gov/regno=0000107062
13|C009947|PBKONEOXTCPAFI|http://sis.gov/regno=0000120821
13|C009947|PBKONEOXTCPAFI|http://sis.gov/regno=0063697187
Here is a subset of the second file, 2.txt:
CID|bro_URL
11|http://bro.gov/nmbr=0149
13|http://bro.gov/nmbr=0119
Am running gnuwin32 under Windows 7, 64 bit with 8gb memory; therefore need to use double quote for windows. The join command is:
join -t"|" -1 1 -2 1 -a1 -a2 -e "NULL" -o "0,1.2,1.3,1.4,2.2" 1.txt 2.txt > 3_.txt
Here is the output file, 3.txt.
CID|UID|Key|sis_URL|bro_URL
1|D000108|RDHQFKQIGNGIED|http://sis.gov/regno=0000870779|NULL
1|D000108|RDHQFKQIGNGIED|http://sis.gov/regno=0014992622|NULL
1|D000644|RDHQFKQIGNGIED|http://sis.gov/regno=0000870779|NULL
1|D000644|RDHQFKQIGNGIED|http://sis.gov/regno=0014992622|NULL
1|D002331|RDHQFKQIGNGIED|http://sis.gov/regno=0000870779|NULL
1|D002331|RDHQFKQIGNGIED|http://sis.gov/regno=0014992622|NULL
11|NULL|NULL|NULL|http://bro.gov/nmbr=0149
13|NULL|NULL|NULL|http://bro.gov/nmbr=0119
11|C024565|WSLDOOZREJYCGB|http://sis.gov/regno=0000107062|NULL
13|C009947|PBKONEOXTCPAFI|http://sis.gov/regno=0000120821|NULL
13|C009947|PBKONEOXTCPAFI|http://sis.gov/regno=0063697187|NULL
For CID:11 and CID:13, I am expecting:
11|C024565|WSLDOOZREJYCGB|http://sis.gov/regno=0000107062|http://bro.gov/nmbr=0149
13|C009947|PBKONEOXTCPAFI|http://sis.gov/regno=0000120821|http://bro.gov/nmbr=0119
13|C009947|PBKONEOXTCPAFI|http://sis.gov/regno=0063697187|http://bro.gov/nmbr=0119
Why does the join on CID:11 and CID:13 fail?
Note: before posting this question I ran the subset above and produced the proper results. When I run the complete set, I get the improper result (the subset shown here).
Any idea why? Any recommended alternative?
When I've completed the join process, my final table will be 15 columns wide. But I'm already stymied at column 4.
Any proposed work-around, such as awk?
You can try the following command:
awk -f a.awk 2.txt 1.txt > 3.txt
where a.awk is:
BEGIN { FS=OFS="|" }
NR==FNR{
a[$1]=$2
next
}
{
if ($1 in a)
$(NF+1)=a[$1]
else
$(NF+1)="NULL"
print
}
with output:
CID|UID|Key|sis_URL|bro_URL
1|D000108|RDHQFKQIGNGIED|http://sis.gov/regno=0000870779
1|D000108|RDHQFKQIGNGIED|http://sis.gov/regno=0014992622
1|D000644|RDHQFKQIGNGIED|http://sis.gov/regno=0000870779
1|D000644|RDHQFKQIGNGIED|http://sis.gov/regno=0014992622
1|D002331|RDHQFKQIGNGIED|http://sis.gov/regno=0000870779
1|D002331|RDHQFKQIGNGIED|http://sis.gov/regno=0014992622
11|C024565|WSLDOOZREJYCGB|http://sis.gov/regno=0000107062|http://bro.gov/nmbr=0149
13|C009947|PBKONEOXTCPAFI|http://sis.gov/regno=0000120821|http://bro.gov/nmbr=0119
13|C009947|PBKONEOXTCPAFI|http://sis.gov/regno=0063697187|http://bro.gov/nmbr=0119
Explanation
We read the data in 2.txt into the associative array a
The test NR==FNR is used to match only the first file on the command line, that is the file 2.txt
The next statement is important so that the next rules are not executed for 2.txt
The second rule (the one containing the if test) is then executed only for 1.txt but the information from 2.txt is still available through the array a
If the first field matches a field in the first column in 2.txt, that is: if ($1 in a), then we insert that value at the end of the line (NF is the number of fields (that is: columns) read from 1.txt)
If there is no match, we insert the string "NULL"
Related
I am trying to write a simple foreach loop that would list which files are present and which are not in a directory. For this example, let's pretend we have a directory with two files called def_456 and ghi_789. If I were to do something like this:
foreach f (`xargs`)
abc
def
ghi
foreach$ ls /path/to/dir/* | grep $f
foreach$ end
I would expect it to spit out def_456 ghi_789 since this directory does not have file called abc_123. Is there a way to have it let me know that abc_123 is missing? Is there a flag I can use with grep or something else that will tell which of my list in the foreach loop were not present?
Thanks!
Edited to hopefully make this question more clear and to remove the bash tag and add the sh tag. I hope this is correct. As you can all tell I am quite a noob when it comes to this stuff.
It's not grep, but the comm utility might be useful. It compares two sorted files line by line. If you have your list of files that you're looking for in a file called a, and the list of things that were actually there in b, you could do comm a b:
With no options, [comm produces] three-column output. Column one contains lines unique to FILE1, column two contains lines unique to FILE2, and column three contains lines common to both files.
If you hide two of these columns, you end up with a plain listing.
comm -12 a b: Only print entries the files had in common (hide columns one and two)
comm -13 a b: Only print entries unique to the second file (d f g s in your example)
comm -23 a b: Only entries unique to the first file (a in your example)
Of course, you can use Bash <(command) syntax to read from commands instead of real files. It should work the same, and you can easily tack on a | sort if need be.
The return code of grep will tell you - if $? after grep is 0 then what you grepped was present, if it is 1 it was not present
So
foreach f (`xargs`)
abc
def
ghi
foreach$ ls /path/to/dir/* | grep -q $f
foreach$ test $? -eq 1 && echo $f
foreach$ end
will revert the output - it should print what is not found - not sure though if that is what you wanted....
I have a query wherein I am taking the value from second column from a file and searching if it is present in another 2 file
If it returns true then I execute a set of commands, if it returns false then there are another set of commands.
After execution the loop has to again go into the second column and search from second row, how do I do this ?
I have tried the below:
for i in $(cat file1.txt) ; do grep $i *ary.txt ; done
This is not working for me, if I simply extract the column 2 contents and display it in a file then it will grep and give me the Output
What am I missing
In UNIX, I would like to filter my 3 columns file based on the "DP" values that within the 3rd column.
I'd like to obtain only rows that have DP values higher than 7.
A|49.14|AC=2;AF=0.500;AN=4;BaseQRankSum=1.380;DP=6;Dels=0.00;
T|290.92|AC=2;AF=1.00;AN=2;DP=8;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;
T|294.75|AC=6;AF=1.00;AN=6;DP=9;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=6;
I'm using here "|" for separating between my three columns
Here's one simple solution
$ echo "A|49.14|AC=2;AF=0.500;AN=4;BaseQRankSum=1.380;DP=6;Dels=0.00;
AC=6;AF=1.00;AN=6;DP=9;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=6;T|290.92|AC=2;AF=1.00;AN=2;DP=8;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;
MLEAC=6;" \
| awk '{dpVal=$0;sub(/.*DP=/, "", dpVal);sub(/;.*$/,"", dpVal); if (dpVal>7) print}'
output
T|290.92|AC=2;AF=1.00;AN=2;DP=8;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;
T|294.75|AC=6;AF=1.00;AN=6;DP=9;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=6;
This makes a copy of each line ($0), the strips away everything before DP=, and everything after the trailing ; char for that field, leaving just the value for DP. That value is tested, and if true the whole line is printed (the default action of awk print is to print the whole line, but you can tell it to print anything you like, maybe print "Found it:" $0 or zillons of variants.
edit
I would like to keep all the first 53 lines intact and save them as well to my Output.txt file.
Yes, very easy, you're on the right track. With awk is is very easy to have multiple conditions process different parts or conditions in a file. Try this:
awk 'FNR <= 53 {print}
FNR > 53 {
vpVal=$0;sub(/.*DP=/, "", dpVal);sub(/;.*$/,"", dpVal)
if (dpVal>7) print
}' File.vcf > Output.txt
(i don't have a file to test with, so let me know if this isn't right).
IHTH
ab.txt
30 abcd
45 sadsad
ac.txt
50 ab
430 sadsad
how can i join these so that i can get
abc.txt
30 abcd
50 ab
475 sadsad
Strings in second column are unique though
i tried reading line by line from first txt file and reading the string and searching this with grep in second file and then getting the number from it , in same way reading line by line from second file and get the string which is not in first
but this process is too tiresome and worst possible way,
any nice procedure doing it with join easily?
This will work with GNU awk:
awk '{ sum[$2] += $1 } END { for (name in sum) print sum[name] " " name }' a b |
sort -n
Explanation:
As each line is read from the files, the variables $1 and $2 contain the first and second fields (that is, the value and the string). The associative array sum uses the strings as its indices, so it accumulates the values associated with each string. Then, at the end of input (after the last file is read), it loops through the name of each string that's an index in sum, printing its total and its name. Finally, it sorts the output numerically.
We need a shell script that retrieves all txt files in the current directory and for each file checks if it is an empty file or contains any data in it (which I believe can be done with wc command).
If it is empty then ignore it else since in our condition, all txt files in this directory will either be empty or contain huge data wherein the last line of the file will be like this:
Z|11|21||||||||||
That is the last line has the character Z then | then an integer then | then an integer then certain numbers of | symbols.
If the file is not empty, then we just assume it to have this format. Data before the last line are garbled and not necessary for us but there will be at least one line before the last line, i.e. there will be at least two lines guaranteed if the file is non-empty.
We need a code wherein, if the file is non-empty, then it takes the file, replaces the 'Z' in the last line with 'filename.txt' and writes the new data into another file say tempfile. The last line will thus become as:
filename.txt|11|21|||||||
Remaining part of the line remains same. From the tempfile, the last line, i.e., filename.txt|int|int||||| is taken out and merged into a finalfile. The contents of tempfile is cleared to receive data from next filename.txt in the same directory. finalfile has the edited version of the last lines of all non-empty txt files in that directory.
Eg: file1.txt has data as
....
....
....
Z|1|1|||||
and file2.txt has data as
....
....
....
Z|2|34|||||
After running the script, new data of file1.txt becomes
.....
.....
.....
file1.txt|1|1||||||
This will be written into a new file say temp.txt which is initially empty. From there the last line is merged into a file final.txt. So, the data in final.txt is:
file1.txt|1|1||||||
After this merging, the data in temp.txt is cleared
New data of file2.txt becomes
...
...
...
file2.txt|2|34||||||
This will be written into the same file temp.txt. From there the last line is merged into the same file final.txt.
So, the data in final.txt is
file1.txt|1|1||||||
file2.txt|2|34||||||
After considering N number of files that was returned to be as of type txt and non-empty and within the same directory, the data in final.txt becomes
file1.txt|1|1||||||
file2.txt|2|34||||||
file3.txt|8|3||||||
.......
.......
.......
fileN.txt|22|3|||||
For some of the conditions, I already know the command, like
For finding files in a directory of type text,
find <directory> -type f -name "*.txt"
For taking the last line and merging it into another file
tail -1 file.txt>>destination.txt
You can use 'sed' to replace the "z" character. You'll be in a loop, so you can use the filename that you have in that. This just removes the Z, and then echos the line and filename.
Good luck.
#!/bin/bash
filename=test.txt
line=`tail -1 $filename | sed "s/Z/$filename/"`
echo $line
Edit:
Did you run your find command first, and see the output? It has of course a ./ at the start of each line. That will break sed, since sed uses / as a delimiter. It also will not work with your problem statement, which does not have an extra "/" before the filename. You said current directory, and the command you give will traverse ALL subdirectories. Try being simple and using LS.
# `2>/dev/null` puts stderr to null, instead of writing to screen. this stops
# us getting the "no files found" (error) and thinking it's a file!
for filename in `ls *.txt 2>/dev/null` ; do
... stuff ...
done