Calling a function from awk with variable input location - unix

I have a bunch of different files.We have used "|" as delimeter All files contain a column titled CARDNO, but not necessarily in the same location in all of the files. I have a function called data_mask. I want to apply to CARDNO in all of the files to change them into NEWCARDNO.
I know that if I pass in the column number of CARDNO I can do this pretty simply, say it's the 3rd column in a 5 column file with something like:
awk -v column=$COLNUMBER '{print $1, $2, FUNCTION($column), $4, $5}' FILE
However, if all of my files have hundreds of columns and it's somewhere arbitrary in each file, this is incredibly tedious. I am looking for a way to do something along the lines of this:
awk -v column=$COLNUMBER '{print #All columns before $column, FUNCTION($column), #All columns after $column}' FILE
My function takes a string as an input and changes it into a new one. It takes the value of the column as an input, not the column number. Please suggest me Unix command which can pass the column value to the function and give the desired output.
Thanks in advance

If I understand your problem correctly, the first row of the file is the header and one of those columns is named CARDNO. If this is the case then you just search for the header in that file and process accordingly.
awk 'BEGIN{FS=OFS="|";c=1}
(NR==1){while($c != "CARDNO" && c<=NF) c++
if(c>NF) exit
$c="NEWCARDNO" }
(NR!=1){$c=FUNCTION($c)}
{print}' <file>
As per comment, if there is no header in the file, but you know per file, which column number it is, then you can simply do:
awk -v c=$column 'BEGIN{FS=OFS="|"}{$c=FUNCTION($c)}1' <file>

Related

How to select multiple columns which are not next to each other?

I have a dataset which I am trying to select the first 10 columns from, and the last 27 columns from (from the 125th column onwards to the final 152nd column).
awk 'BEGIN{FS="\t"} { printf $1,$2,$3,$4,$5,$6,$7,$8,$9,$10; for(i=125; i<=NF; ++i) printf $i""FS; print ""}' Bigdata.txt > Smalldata.txt
With trying this code it gives me the first 12 columns (with their data) and all the headers for all 152 columns from my original big data file. How do I select both columns 1-10 and 125-152 to go into a new file? I am new to linux and any guidence would be appreciated.
don't reinvent the wheel, if you already know the number of columns cut is the tool for this task.
$ cut -f1-10,125-152 bigdata
tab is the default delimiter.
If you don't know the number of columns, awk comes to the rescue!
$ cut -f1-10,$(awk '{print NF-27"-"NF; exit}' file) file
awk will print the end range by reading the first line of the file.
Using the KISS principle
awk 'BEGIN{FS=OFS="\t"}
{ c=""; for(i=1;i<=10;++i) { printf c $i; c=OFS}
for(i=NF-27;i<=NF;++i) { printf c $i }
printf ORS }' file
Could you please try following, since no samples produced so couldn't test it. You need NOT to manually write 1...10 field values you could use a loop for that too.
awk 'BEGIN{FS=OFS="\t"}{for(i=1;i<=10;i++){printf("%s%s",$i,OFS)};for(i=(NF-27);i<=NF;i++){printf("%s%s",$i,i==NF?ORS:OFS)}}' Input_file > output_file
Also you need NOT to worry about headers here, since we are simply printing the lines and no logic specifically applied for lines so no need to add any specific entry for 1st line or so.
EDIT: 1 more point here seems you meant that different column values(in different ranges) should come in single line(for a single line from Input) if this is the case then my above code should handle it, since I am printing spaces as separator for their values and printing a new only when their last field's value is printed, by this each line from Input_file fields will be on same line(as Input_file's entry).
Explanation: Adding detailed explanation here.
awk ' ##Starting awk program here.
BEGIN{ ##Starting BEGIN section here, which will be executed before Input_file is getting read.
FS=OFS="\t" ##Setting FS and OFS as TAB here.
} ##Closing BEGIN section here for this awk code.
{ ##Starting a new BLOCK which will be executed when Input_file is being read.
for(i=1;i<=10;i++){ ##Running a for loop which will run 10 times from i=1 to i=10 value.
printf("%s%s",$i,OFS) ##Printing value of specific field with OFS value.
} ##Closing for loop BLOCK here.
for(i=(NF-27);i<=NF;i++){ ##Starting a for loop which will run for 27 last fields only as per OP requirements.
printf("%s%s",$i,i==NF?ORS:OFS) ##Printing field value and checking condition i==NF, if field is last field of line print new line else print space.
} ##Closing block for, for loop now.
}' Input_file > output_file ##Mentioning Input_file name here, whose output is going into output_file.

awk - Find duplicate entries in 2 columns, keep 1 duplicate and unique entries

I need to find a duplicate entry in 2 different columns and keep only one of the duplicate and all unique entries. For me if A123 is in the first column and it show up later in the second column it's a duplicate. I also know for sure that A123 will always be paired to B123 by either being A123,B123 or B123,A123. I only need to keep one and it doesn't matter which one it is.
Ex: My input file would contain:
A123,B123
A234,B234
C123,D123
B123,A123
B234,A234
I'd like the output to be:
A123,B123
A234,B234
C123,D123
The best I can do is to extract the unique entries with :
awk -F',' 'NR==FNR{x[$1]++;next}; !x[$2]' file1 file1
or get only the duplicates with
awk -F',' 'NR==FNR{x[$1]++;next}; x[$2]' file1 file1
Any help would be greatly appreciated.
This can be shorter!
First print if the element is not yet present in the array. Then add the first field to the array. Only one run over the inputfile is necessary:
awk -F, '!x[$2];{x[$1]++}' file1
This awk one-liner works for your example:
awk -F, '!($2 in a){a[$1]=$0}END{for(x in a)print a[x]}' file
The conventional, idiomatic awk solution:
$ awk -F, '!seen[$1>$2 ? $1 : $2]++' file
A123,B123
A234,B234
C123,D123
By convention we always use seen (rather than x or anything else) as the array name when it represents a set where you want to check if it's index has been seen before, and using a ternary expression to produce the largest of the possible key values as the index ensures the order they appear in the input doesn't matter.
The above doesn't care about your unique situation where every $2 is paired to a specific $1 - it simply prints unique individual occurrences across a pair of fields. If you wanted it to work on the pair of fields combined (and assuming you have more fields so just using $0 as the index wouldn't work) that'd be:
awk -F, '!seen[$1>$2 ? $1 FS $2 : $2 FS $1]++' file

Search list of ids in a log file in unix

Sorry to ask this, might be a trivial question, tried awk script as well. But I think I am new to that.
I have a list of Ids in a file i.e. ids.txt
1xre23
223dsf
234ewe
and a log file with FIX messages which might contain those ids.
sample: log file abc.log
35=D^A54=1xre23^A22=s^A120=GBP^A
35=D^A54=abcd23^A22=s^A120=GBP^A
35=D^A54=234ewe^A22=s^A120=GBP^A
35=D^A54=xyzw23^A22=s^A120=GBP^A
35=D^A54=223dsf^A22=s^A120=GBP^A
I want to check how many ids matched in that log file.
Ids are large almost 10K, and log file size is around 300MB.
sample output I am looking for is.
output:
35=D^A54=1xre23^A22=s^A120=GBP^A
35=D^A54=234ewe^A22=s^A120=GBP^A
35=D^A54=223dsf^A22=s^A120=GBP^A
Try something like with grep command:
grep -w -f ids.txt abc.log
Output:
35=D^A54=1xre23^A22=s^A120=GBP^A<br>
35=D^A54=234ewe^A22=s^A120=GBP^A<br>
35=D^A54=223dsf^A22=s^A120=GBP^A<br>
If you like to use awk this should do:
awk -F"[=^]" 'FNR==NR {a[$0];next} $4 in a' ids.txt abc.log
35=D^A54=1xre23^A22=s^A120=GBP^A
35=D^A54=234ewe^A22=s^A120=GBP^A
35=D^A54=223dsf^A22=s^A120=GBP^A
This store the ids.txt in array a
If fourth field (separated by = and ^) contains the ID, print it.
You can also do it the other way around:
awk 'FNR==NR {a[$0];next} {for (i in a) if ($0~i) print}' abc.log ids.txt
35=D^A54=1xre23^A22=s^A120=GBP^A
35=D^A54=234ewe^A22=s^A120=GBP^A
35=D^A54=223dsf^A22=s^A120=GBP^A
Store all data from abc.log in array a
Then test if line contains data for id.txt
If yes, print the line.

Median Calculation in Unix

I need to calculate median value for the below input file. It is working fine for odd occurrences but not for even occurrences. Below is the input file and the script used. Could you please check what is wrong with this command and correct the same.
Input file:
col1,col2
AR,2.52
AR,3.57
AR,1.29
AR,6.66
AR,3.05
AR,5.52
Desired Output:
AR,3.31
Unix command:
cat test.txt | sort -t"," -k2n,2 | awk '{arr[NR]=$1} END { if (NR%2==1) print arr[(NR+1)/2]; else print (arr[NR/2]+arr[NR/2+1])/2}'
Don't forget that your input file has an additional line, containing the header. You need to take an additional step in your awk script to skip the first line.
Also, due to the fact you're using the default field separator, $1 will contain the whole line, so your code arr[NR/2]+arr[NR/2+1])/2 is never going to work. I would suggest that you changed it so that awk splits the input on a comma, then use the second field $2.
sort -t, -k2n,2 file | awk -F, 'NR>1{a[++i]=$2}END{if(i%2==1)print a[(i+1)/2];else print (a[i/2]+a[i/2+1])/2}'
I also removed your useless use of cat. Most tools, including sort and awk, are capable of reading in files directly, so you don't need to use cat with them.
Testing it out:
$ cat file
col1,col2
AR,2.52
AR,3.57
AR,1.29
AR,6.66
AR,3.05
AR,5.52
$ sort -t, -k2n,2 file | awk -F, 'NR>1{a[++i]=$2}END{if(i%2==1)print a[(i+1)/2];else print (a[i/2]+a[i/2+1])/2}'
3.31
It shouldn't be too difficult to modify the script slightly to change the output to whatever you want.

Filter file based on internal value within a column

In UNIX, I would like to filter my 3 columns file based on the "DP" values that within the 3rd column.
I'd like to obtain only rows that have DP values higher than 7.
A|49.14|AC=2;AF=0.500;AN=4;BaseQRankSum=1.380;DP=6;Dels=0.00;
T|290.92|AC=2;AF=1.00;AN=2;DP=8;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;
T|294.75|AC=6;AF=1.00;AN=6;DP=9;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=6;
I'm using here "|" for separating between my three columns
Here's one simple solution
$ echo "A|49.14|AC=2;AF=0.500;AN=4;BaseQRankSum=1.380;DP=6;Dels=0.00;
AC=6;AF=1.00;AN=6;DP=9;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=6;T|290.92|AC=2;AF=1.00;AN=2;DP=8;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;
MLEAC=6;" \
| awk '{dpVal=$0;sub(/.*DP=/, "", dpVal);sub(/;.*$/,"", dpVal); if (dpVal>7) print}'
output
T|290.92|AC=2;AF=1.00;AN=2;DP=8;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;
T|294.75|AC=6;AF=1.00;AN=6;DP=9;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=6;
This makes a copy of each line ($0), the strips away everything before DP=, and everything after the trailing ; char for that field, leaving just the value for DP. That value is tested, and if true the whole line is printed (the default action of awk print is to print the whole line, but you can tell it to print anything you like, maybe print "Found it:" $0 or zillons of variants.
edit
I would like to keep all the first 53 lines intact and save them as well to my Output.txt file.
Yes, very easy, you're on the right track. With awk is is very easy to have multiple conditions process different parts or conditions in a file. Try this:
awk 'FNR <= 53 {print}
FNR > 53 {
vpVal=$0;sub(/.*DP=/, "", dpVal);sub(/;.*$/,"", dpVal)
if (dpVal>7) print
}' File.vcf > Output.txt
(i don't have a file to test with, so let me know if this isn't right).
IHTH

Resources