AWK from a second column and grep it in 2 files - unix

I have a query wherein I am taking the value from second column from a file and searching if it is present in another 2 file
If it returns true then I execute a set of commands, if it returns false then there are another set of commands.
After execution the loop has to again go into the second column and search from second row, how do I do this ?
I have tried the below:
for i in $(cat file1.txt) ; do grep $i *ary.txt ; done
This is not working for me, if I simply extract the column 2 contents and display it in a file then it will grep and give me the Output
What am I missing

Related

Calling a function from awk with variable input location

I have a bunch of different files.We have used "|" as delimeter All files contain a column titled CARDNO, but not necessarily in the same location in all of the files. I have a function called data_mask. I want to apply to CARDNO in all of the files to change them into NEWCARDNO.
I know that if I pass in the column number of CARDNO I can do this pretty simply, say it's the 3rd column in a 5 column file with something like:
awk -v column=$COLNUMBER '{print $1, $2, FUNCTION($column), $4, $5}' FILE
However, if all of my files have hundreds of columns and it's somewhere arbitrary in each file, this is incredibly tedious. I am looking for a way to do something along the lines of this:
awk -v column=$COLNUMBER '{print #All columns before $column, FUNCTION($column), #All columns after $column}' FILE
My function takes a string as an input and changes it into a new one. It takes the value of the column as an input, not the column number. Please suggest me Unix command which can pass the column value to the function and give the desired output.
Thanks in advance
If I understand your problem correctly, the first row of the file is the header and one of those columns is named CARDNO. If this is the case then you just search for the header in that file and process accordingly.
awk 'BEGIN{FS=OFS="|";c=1}
(NR==1){while($c != "CARDNO" && c<=NF) c++
if(c>NF) exit
$c="NEWCARDNO" }
(NR!=1){$c=FUNCTION($c)}
{print}' <file>
As per comment, if there is no header in the file, but you know per file, which column number it is, then you can simply do:
awk -v c=$column 'BEGIN{FS=OFS="|"}{$c=FUNCTION($c)}1' <file>

Trying to have grep tell me which values are not present, from a foreach loop

I am trying to write a simple foreach loop that would list which files are present and which are not in a directory. For this example, let's pretend we have a directory with two files called def_456 and ghi_789. If I were to do something like this:
foreach f (`xargs`)
abc
def
ghi
foreach$ ls /path/to/dir/* | grep $f
foreach$ end
I would expect it to spit out def_456 ghi_789 since this directory does not have file called abc_123. Is there a way to have it let me know that abc_123 is missing? Is there a flag I can use with grep or something else that will tell which of my list in the foreach loop were not present?
Thanks!
Edited to hopefully make this question more clear and to remove the bash tag and add the sh tag. I hope this is correct. As you can all tell I am quite a noob when it comes to this stuff.
It's not grep, but the comm utility might be useful. It compares two sorted files line by line. If you have your list of files that you're looking for in a file called a, and the list of things that were actually there in b, you could do comm a b:
With no options, [comm produces] three-column output. Column one contains lines unique to FILE1, column two contains lines unique to FILE2, and column three contains lines common to both files.
If you hide two of these columns, you end up with a plain listing.
comm -12 a b: Only print entries the files had in common (hide columns one and two)
comm -13 a b: Only print entries unique to the second file (d f g s in your example)
comm -23 a b: Only entries unique to the first file (a in your example)
Of course, you can use Bash <(command) syntax to read from commands instead of real files. It should work the same, and you can easily tack on a | sort if need be.
The return code of grep will tell you - if $? after grep is 0 then what you grepped was present, if it is 1 it was not present
So
foreach f (`xargs`)
abc
def
ghi
foreach$ ls /path/to/dir/* | grep -q $f
foreach$ test $? -eq 1 && echo $f
foreach$ end
will revert the output - it should print what is not found - not sure though if that is what you wanted....

unix cut to extract column from text file and save rest of the contents to a new file

I can do the following using unix cut :
cut -f 1 myfile.out
Output:
6DKK463WXXK
VKFQ9PYP9CG
Since its printing out the column that I want to extract. How do I create the a new file without this column? In other words, I want to remove this column now and keep the rest of the content.
Depending on your version of Unix, you may use the negate option to select the fields not listed.
cut -f 2 --complement myfile.input > myfile.output
That will place all the columns from the input file into the output file, except for column 2.
You use the -d argument to specify a delimiter other than tab, which is the default.
Note from experience: Be careful with the > especially when using similar names for input and output so that you don't accidentally overwrite your input file (using tab completion, this is easy to do).
Example:
% echo one two three | cut -d ' ' -f 2 --complement
> one three
Gdang! S.O. must be swamped right now.
This is very easy in awk
echo "1 2 3 4 5" | awk -F" " '{sub(/[^ ]+ /,""); print}'
output
2 3 4 5
Deletes everything upto the first space character.
The the remaining record is printed.
IHTH

Filter file based on internal value within a column

In UNIX, I would like to filter my 3 columns file based on the "DP" values that within the 3rd column.
I'd like to obtain only rows that have DP values higher than 7.
A|49.14|AC=2;AF=0.500;AN=4;BaseQRankSum=1.380;DP=6;Dels=0.00;
T|290.92|AC=2;AF=1.00;AN=2;DP=8;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;
T|294.75|AC=6;AF=1.00;AN=6;DP=9;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=6;
I'm using here "|" for separating between my three columns
Here's one simple solution
$ echo "A|49.14|AC=2;AF=0.500;AN=4;BaseQRankSum=1.380;DP=6;Dels=0.00;
AC=6;AF=1.00;AN=6;DP=9;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=6;T|290.92|AC=2;AF=1.00;AN=2;DP=8;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;
MLEAC=6;" \
| awk '{dpVal=$0;sub(/.*DP=/, "", dpVal);sub(/;.*$/,"", dpVal); if (dpVal>7) print}'
output
T|290.92|AC=2;AF=1.00;AN=2;DP=8;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;
T|294.75|AC=6;AF=1.00;AN=6;DP=9;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=6;
This makes a copy of each line ($0), the strips away everything before DP=, and everything after the trailing ; char for that field, leaving just the value for DP. That value is tested, and if true the whole line is printed (the default action of awk print is to print the whole line, but you can tell it to print anything you like, maybe print "Found it:" $0 or zillons of variants.
edit
I would like to keep all the first 53 lines intact and save them as well to my Output.txt file.
Yes, very easy, you're on the right track. With awk is is very easy to have multiple conditions process different parts or conditions in a file. Try this:
awk 'FNR <= 53 {print}
FNR > 53 {
vpVal=$0;sub(/.*DP=/, "", dpVal);sub(/;.*$/,"", dpVal)
if (dpVal>7) print
}' File.vcf > Output.txt
(i don't have a file to test with, so let me know if this isn't right).
IHTH

Unix join produces inconsistent results on Windows 7

Have a data set where the max number of records in one file is ~ 130,000.
Here is a subset of the first file, 1.txt:
CID|UID|Key|sis_URL
1|D000108|RDHQFKQIGNGIED|http://sis.gov/regno=0000870779
1|D000108|RDHQFKQIGNGIED|http://sis.gov/regno=0014992622
1|D000644|RDHQFKQIGNGIED|http://sis.gov/regno=0000870779
1|D000644|RDHQFKQIGNGIED|http://sis.gov/regno=0014992622
1|D002331|RDHQFKQIGNGIED|http://sis.gov/regno=0000870779
1|D002331|RDHQFKQIGNGIED|http://sis.gov/regno=0014992622
11|C024565|WSLDOOZREJYCGB|http://sis.gov/regno=0000107062
13|C009947|PBKONEOXTCPAFI|http://sis.gov/regno=0000120821
13|C009947|PBKONEOXTCPAFI|http://sis.gov/regno=0063697187
Here is a subset of the second file, 2.txt:
CID|bro_URL
11|http://bro.gov/nmbr=0149
13|http://bro.gov/nmbr=0119
Am running gnuwin32 under Windows 7, 64 bit with 8gb memory; therefore need to use double quote for windows. The join command is:
join -t"|" -1 1 -2 1 -a1 -a2 -e "NULL" -o "0,1.2,1.3,1.4,2.2" 1.txt 2.txt > 3_.txt
Here is the output file, 3.txt.
CID|UID|Key|sis_URL|bro_URL
1|D000108|RDHQFKQIGNGIED|http://sis.gov/regno=0000870779|NULL
1|D000108|RDHQFKQIGNGIED|http://sis.gov/regno=0014992622|NULL
1|D000644|RDHQFKQIGNGIED|http://sis.gov/regno=0000870779|NULL
1|D000644|RDHQFKQIGNGIED|http://sis.gov/regno=0014992622|NULL
1|D002331|RDHQFKQIGNGIED|http://sis.gov/regno=0000870779|NULL
1|D002331|RDHQFKQIGNGIED|http://sis.gov/regno=0014992622|NULL
11|NULL|NULL|NULL|http://bro.gov/nmbr=0149
13|NULL|NULL|NULL|http://bro.gov/nmbr=0119
11|C024565|WSLDOOZREJYCGB|http://sis.gov/regno=0000107062|NULL
13|C009947|PBKONEOXTCPAFI|http://sis.gov/regno=0000120821|NULL
13|C009947|PBKONEOXTCPAFI|http://sis.gov/regno=0063697187|NULL
For CID:11 and CID:13, I am expecting:
11|C024565|WSLDOOZREJYCGB|http://sis.gov/regno=0000107062|http://bro.gov/nmbr=0149
13|C009947|PBKONEOXTCPAFI|http://sis.gov/regno=0000120821|http://bro.gov/nmbr=0119
13|C009947|PBKONEOXTCPAFI|http://sis.gov/regno=0063697187|http://bro.gov/nmbr=0119
Why does the join on CID:11 and CID:13 fail?
Note: before posting this question I ran the subset above and produced the proper results. When I run the complete set, I get the improper result (the subset shown here).
Any idea why? Any recommended alternative?
When I've completed the join process, my final table will be 15 columns wide. But I'm already stymied at column 4.
Any proposed work-around, such as awk?
You can try the following command:
awk -f a.awk 2.txt 1.txt > 3.txt
where a.awk is:
BEGIN { FS=OFS="|" }
NR==FNR{
a[$1]=$2
next
}
{
if ($1 in a)
$(NF+1)=a[$1]
else
$(NF+1)="NULL"
print
}
with output:
CID|UID|Key|sis_URL|bro_URL
1|D000108|RDHQFKQIGNGIED|http://sis.gov/regno=0000870779
1|D000108|RDHQFKQIGNGIED|http://sis.gov/regno=0014992622
1|D000644|RDHQFKQIGNGIED|http://sis.gov/regno=0000870779
1|D000644|RDHQFKQIGNGIED|http://sis.gov/regno=0014992622
1|D002331|RDHQFKQIGNGIED|http://sis.gov/regno=0000870779
1|D002331|RDHQFKQIGNGIED|http://sis.gov/regno=0014992622
11|C024565|WSLDOOZREJYCGB|http://sis.gov/regno=0000107062|http://bro.gov/nmbr=0149
13|C009947|PBKONEOXTCPAFI|http://sis.gov/regno=0000120821|http://bro.gov/nmbr=0119
13|C009947|PBKONEOXTCPAFI|http://sis.gov/regno=0063697187|http://bro.gov/nmbr=0119
Explanation
We read the data in 2.txt into the associative array a
The test NR==FNR is used to match only the first file on the command line, that is the file 2.txt
The next statement is important so that the next rules are not executed for 2.txt
The second rule (the one containing the if test) is then executed only for 1.txt but the information from 2.txt is still available through the array a
If the first field matches a field in the first column in 2.txt, that is: if ($1 in a), then we insert that value at the end of the line (NF is the number of fields (that is: columns) read from 1.txt)
If there is no match, we insert the string "NULL"

Resources