Using awk on all columns for just part of column content - unix

I trying to find a solution for the following. I have a list of gene IDs in my first column and in all the other columns the related GO terms. The number of columns behind each gene ID is therefor variable. As follows the first few lines:
TRINITY_DN173118_c0_g1 GO:0000139^cellular_component^Golgi membrane
TRINITY_DN49436_c2_g1 GO:0006351^biological_process^transcription, DNA-templated
TRINITY_DN47442_c0_g1 GO:0003677^molecular_function^DNA binding GO:0006302^biological_process^double-strand break repair GO:0006310^biological_process^DNA recombination
The GO terms are delimited with a tab. I want to keep the first column, with the IDs and all the columns that contain "biological_process". But how do I do that using awk, without a specific column to search in.
I basically want to use grep for columns, so was trying something with awk (but I am not experienced in awk at all):
awk '/biological_process/' -> I get the full line
awk '{ print "biological_process" }' -> I only get biological process
Can someone help me out? THanks!

AWK:
awk -F"GO:" '{printf "%s",$1}{for(i=2;i<=NF;i++) if ($i~/biological_process/)printf FS"%s",$i ;print ""}' file
1) -F"GO:" - use "GO:" string as separator
2) {printf "%s",$1} - print the first column (without new line)
3) for(i=2;i<=NF;i++) - run on all columns beside the first one
4) ($i~/biological_process/) - check if string exists in col
5) printf FS"%s",$i - if string exists in column print the separator and the string
6) print "" - print new line
input file used:
TRINITY_DN173118_c0_g1 GO:0000139^cellular_component^Golgi membrane
TRINITY_DN49436_c2_g1 GO:0006351^biological_process^transcription, DNA-templated
TRINITY_DN47442_c0_g1 GO:0003677^molecular_function^DNA binding GO:0006302^biological_process^double-strand break repair GO:0006310^biological_process^DNA recombination
output
TRINITY_DN173118_c0_g1
TRINITY_DN49436_c2_g1 GO:0006351^biological_process^transcription, DNA-templated
TRINITY_DN47442_c0_g1 GO:0006302^biological_process^double-strand break repair GO:0006310^biological_process^DNA recombination
Thanks to Ed Morton for the feedback , I have edit the Answer :).

another similar awk
$ awk 'BEGIN {FS=OFS="\t"}
{line=$1;
for(i=2;i<=NF;i++) if($i~/biological_process/) line=line OFS $i;
print line}' file
TRINITY_DN173118_c0_g1
TRINITY_DN49436_c2_g1 GO:0006351^biological_process^transcription, DNA-templated
TRINITY_DN47442_c0_g1 GO:0006302^biological_process^double-strand break repair GO:0006310^biological_process^DNA recombination

Related

How to select multiple columns which are not next to each other?

I have a dataset which I am trying to select the first 10 columns from, and the last 27 columns from (from the 125th column onwards to the final 152nd column).
awk 'BEGIN{FS="\t"} { printf $1,$2,$3,$4,$5,$6,$7,$8,$9,$10; for(i=125; i<=NF; ++i) printf $i""FS; print ""}' Bigdata.txt > Smalldata.txt
With trying this code it gives me the first 12 columns (with their data) and all the headers for all 152 columns from my original big data file. How do I select both columns 1-10 and 125-152 to go into a new file? I am new to linux and any guidence would be appreciated.
don't reinvent the wheel, if you already know the number of columns cut is the tool for this task.
$ cut -f1-10,125-152 bigdata
tab is the default delimiter.
If you don't know the number of columns, awk comes to the rescue!
$ cut -f1-10,$(awk '{print NF-27"-"NF; exit}' file) file
awk will print the end range by reading the first line of the file.
Using the KISS principle
awk 'BEGIN{FS=OFS="\t"}
{ c=""; for(i=1;i<=10;++i) { printf c $i; c=OFS}
for(i=NF-27;i<=NF;++i) { printf c $i }
printf ORS }' file
Could you please try following, since no samples produced so couldn't test it. You need NOT to manually write 1...10 field values you could use a loop for that too.
awk 'BEGIN{FS=OFS="\t"}{for(i=1;i<=10;i++){printf("%s%s",$i,OFS)};for(i=(NF-27);i<=NF;i++){printf("%s%s",$i,i==NF?ORS:OFS)}}' Input_file > output_file
Also you need NOT to worry about headers here, since we are simply printing the lines and no logic specifically applied for lines so no need to add any specific entry for 1st line or so.
EDIT: 1 more point here seems you meant that different column values(in different ranges) should come in single line(for a single line from Input) if this is the case then my above code should handle it, since I am printing spaces as separator for their values and printing a new only when their last field's value is printed, by this each line from Input_file fields will be on same line(as Input_file's entry).
Explanation: Adding detailed explanation here.
awk ' ##Starting awk program here.
BEGIN{ ##Starting BEGIN section here, which will be executed before Input_file is getting read.
FS=OFS="\t" ##Setting FS and OFS as TAB here.
} ##Closing BEGIN section here for this awk code.
{ ##Starting a new BLOCK which will be executed when Input_file is being read.
for(i=1;i<=10;i++){ ##Running a for loop which will run 10 times from i=1 to i=10 value.
printf("%s%s",$i,OFS) ##Printing value of specific field with OFS value.
} ##Closing for loop BLOCK here.
for(i=(NF-27);i<=NF;i++){ ##Starting a for loop which will run for 27 last fields only as per OP requirements.
printf("%s%s",$i,i==NF?ORS:OFS) ##Printing field value and checking condition i==NF, if field is last field of line print new line else print space.
} ##Closing block for, for loop now.
}' Input_file > output_file ##Mentioning Input_file name here, whose output is going into output_file.

awk - Find duplicate entries in 2 columns, keep 1 duplicate and unique entries

I need to find a duplicate entry in 2 different columns and keep only one of the duplicate and all unique entries. For me if A123 is in the first column and it show up later in the second column it's a duplicate. I also know for sure that A123 will always be paired to B123 by either being A123,B123 or B123,A123. I only need to keep one and it doesn't matter which one it is.
Ex: My input file would contain:
A123,B123
A234,B234
C123,D123
B123,A123
B234,A234
I'd like the output to be:
A123,B123
A234,B234
C123,D123
The best I can do is to extract the unique entries with :
awk -F',' 'NR==FNR{x[$1]++;next}; !x[$2]' file1 file1
or get only the duplicates with
awk -F',' 'NR==FNR{x[$1]++;next}; x[$2]' file1 file1
Any help would be greatly appreciated.
This can be shorter!
First print if the element is not yet present in the array. Then add the first field to the array. Only one run over the inputfile is necessary:
awk -F, '!x[$2];{x[$1]++}' file1
This awk one-liner works for your example:
awk -F, '!($2 in a){a[$1]=$0}END{for(x in a)print a[x]}' file
The conventional, idiomatic awk solution:
$ awk -F, '!seen[$1>$2 ? $1 : $2]++' file
A123,B123
A234,B234
C123,D123
By convention we always use seen (rather than x or anything else) as the array name when it represents a set where you want to check if it's index has been seen before, and using a ternary expression to produce the largest of the possible key values as the index ensures the order they appear in the input doesn't matter.
The above doesn't care about your unique situation where every $2 is paired to a specific $1 - it simply prints unique individual occurrences across a pair of fields. If you wanted it to work on the pair of fields combined (and assuming you have more fields so just using $0 as the index wouldn't work) that'd be:
awk -F, '!seen[$1>$2 ? $1 FS $2 : $2 FS $1]++' file

Expressing tail through a variable

So I have a chunk of formatted text, I basically need to use awk to get certain columns out of it. The first thing I did was get rid of the first 10 lines (the header information, irrelevant to the info I need).
Next I got the tail by taking the total lines in the file minus 10.
Here's some code:
import=$HOME/$1
if [ -f "$import" ]
then
#file exists
echo "File Exists."
totalLines=`wc -l < $import`
linesMinus=`expr $totalLines - 10`
tail -n $linesMinus $import
headless=`tail -n $linesMinus $import`
else
#file does not exist
echo "File does not exist."
fi
Now I need to save this tail into a variable (or maybe even separate file) so I can access the columns.
The problem comes here:
headless=`tail -n $linesMinus $import`
When I save the code into this variable and then try to echo it back out, it's all unformatted and I can't distinguish columns to use awk on.
How can I save the tail of this file without compromising the formatting?
Just use Awk. It can do everything you need all at once and all in one program.
E.g. to skip the first 10 lines, then print the second, third, and fourth columns separated by spaces for all remaining lines from INPUT_FILE:
awk 'NR <= 10 {next;}
{print $2 " " $3 " " $4;}' INPUT_FILE
Figured it out, I kind of answered my own question when I asked it. All I did was redirect the tail command to a file in the home directory and I can cat that file. Gotta remember to delete it at the end though!

parsing tab separated header of a file in unix

I'm trying to work out a generic script for getting the tab separated column values (as a header to the file). Separating via awk is fine for getting the column names. I'm not getting the idea as how to get each tab separated values until the end or the last NF(if using awk). The number of columns in the file isn't fix, sometime it might come up with 20 columns, sometime it could be 100, etc.
For ex: the tab separated columns in the file are-
abc ttr nnc r32 inc ...
If I write a simple awk as:
head -1 file | awk 'BEGIN {NF="\t"} {print $1,$2, etc}
It'd present each tab separated column represented by $1, $2, etc. I tried an incremental version by replacing $1, $2, etc by $i but wouldn't work.
Any ideas on this?
If I understand correctly, you are asking how to loop over the fields from 1 to NF. Here is an example of such a loop:
$ head -1 file | awk -F"\t" '{for (i=1;i<=NF;i++)printf "%s ",$i; print"";}'
abc ttr nnc r32 inc

Filter file based on internal value within a column

In UNIX, I would like to filter my 3 columns file based on the "DP" values that within the 3rd column.
I'd like to obtain only rows that have DP values higher than 7.
A|49.14|AC=2;AF=0.500;AN=4;BaseQRankSum=1.380;DP=6;Dels=0.00;
T|290.92|AC=2;AF=1.00;AN=2;DP=8;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;
T|294.75|AC=6;AF=1.00;AN=6;DP=9;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=6;
I'm using here "|" for separating between my three columns
Here's one simple solution
$ echo "A|49.14|AC=2;AF=0.500;AN=4;BaseQRankSum=1.380;DP=6;Dels=0.00;
AC=6;AF=1.00;AN=6;DP=9;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=6;T|290.92|AC=2;AF=1.00;AN=2;DP=8;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;
MLEAC=6;" \
| awk '{dpVal=$0;sub(/.*DP=/, "", dpVal);sub(/;.*$/,"", dpVal); if (dpVal>7) print}'
output
T|290.92|AC=2;AF=1.00;AN=2;DP=8;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;
T|294.75|AC=6;AF=1.00;AN=6;DP=9;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=6;
This makes a copy of each line ($0), the strips away everything before DP=, and everything after the trailing ; char for that field, leaving just the value for DP. That value is tested, and if true the whole line is printed (the default action of awk print is to print the whole line, but you can tell it to print anything you like, maybe print "Found it:" $0 or zillons of variants.
edit
I would like to keep all the first 53 lines intact and save them as well to my Output.txt file.
Yes, very easy, you're on the right track. With awk is is very easy to have multiple conditions process different parts or conditions in a file. Try this:
awk 'FNR <= 53 {print}
FNR > 53 {
vpVal=$0;sub(/.*DP=/, "", dpVal);sub(/;.*$/,"", dpVal)
if (dpVal>7) print
}' File.vcf > Output.txt
(i don't have a file to test with, so let me know if this isn't right).
IHTH

Resources