How to select multiple columns which are not next to each other? - unix

I have a dataset which I am trying to select the first 10 columns from, and the last 27 columns from (from the 125th column onwards to the final 152nd column).
awk 'BEGIN{FS="\t"} { printf $1,$2,$3,$4,$5,$6,$7,$8,$9,$10; for(i=125; i<=NF; ++i) printf $i""FS; print ""}' Bigdata.txt > Smalldata.txt
With trying this code it gives me the first 12 columns (with their data) and all the headers for all 152 columns from my original big data file. How do I select both columns 1-10 and 125-152 to go into a new file? I am new to linux and any guidence would be appreciated.

don't reinvent the wheel, if you already know the number of columns cut is the tool for this task.
$ cut -f1-10,125-152 bigdata
tab is the default delimiter.
If you don't know the number of columns, awk comes to the rescue!
$ cut -f1-10,$(awk '{print NF-27"-"NF; exit}' file) file
awk will print the end range by reading the first line of the file.

Using the KISS principle
awk 'BEGIN{FS=OFS="\t"}
{ c=""; for(i=1;i<=10;++i) { printf c $i; c=OFS}
for(i=NF-27;i<=NF;++i) { printf c $i }
printf ORS }' file

Could you please try following, since no samples produced so couldn't test it. You need NOT to manually write 1...10 field values you could use a loop for that too.
awk 'BEGIN{FS=OFS="\t"}{for(i=1;i<=10;i++){printf("%s%s",$i,OFS)};for(i=(NF-27);i<=NF;i++){printf("%s%s",$i,i==NF?ORS:OFS)}}' Input_file > output_file
Also you need NOT to worry about headers here, since we are simply printing the lines and no logic specifically applied for lines so no need to add any specific entry for 1st line or so.
EDIT: 1 more point here seems you meant that different column values(in different ranges) should come in single line(for a single line from Input) if this is the case then my above code should handle it, since I am printing spaces as separator for their values and printing a new only when their last field's value is printed, by this each line from Input_file fields will be on same line(as Input_file's entry).
Explanation: Adding detailed explanation here.
awk ' ##Starting awk program here.
BEGIN{ ##Starting BEGIN section here, which will be executed before Input_file is getting read.
FS=OFS="\t" ##Setting FS and OFS as TAB here.
} ##Closing BEGIN section here for this awk code.
{ ##Starting a new BLOCK which will be executed when Input_file is being read.
for(i=1;i<=10;i++){ ##Running a for loop which will run 10 times from i=1 to i=10 value.
printf("%s%s",$i,OFS) ##Printing value of specific field with OFS value.
} ##Closing for loop BLOCK here.
for(i=(NF-27);i<=NF;i++){ ##Starting a for loop which will run for 27 last fields only as per OP requirements.
printf("%s%s",$i,i==NF?ORS:OFS) ##Printing field value and checking condition i==NF, if field is last field of line print new line else print space.
} ##Closing block for, for loop now.
}' Input_file > output_file ##Mentioning Input_file name here, whose output is going into output_file.

Related

Using awk on all columns for just part of column content

I trying to find a solution for the following. I have a list of gene IDs in my first column and in all the other columns the related GO terms. The number of columns behind each gene ID is therefor variable. As follows the first few lines:
TRINITY_DN173118_c0_g1 GO:0000139^cellular_component^Golgi membrane
TRINITY_DN49436_c2_g1 GO:0006351^biological_process^transcription, DNA-templated
TRINITY_DN47442_c0_g1 GO:0003677^molecular_function^DNA binding GO:0006302^biological_process^double-strand break repair GO:0006310^biological_process^DNA recombination
The GO terms are delimited with a tab. I want to keep the first column, with the IDs and all the columns that contain "biological_process". But how do I do that using awk, without a specific column to search in.
I basically want to use grep for columns, so was trying something with awk (but I am not experienced in awk at all):
awk '/biological_process/' -> I get the full line
awk '{ print "biological_process" }' -> I only get biological process
Can someone help me out? THanks!
AWK:
awk -F"GO:" '{printf "%s",$1}{for(i=2;i<=NF;i++) if ($i~/biological_process/)printf FS"%s",$i ;print ""}' file
1) -F"GO:" - use "GO:" string as separator
2) {printf "%s",$1} - print the first column (without new line)
3) for(i=2;i<=NF;i++) - run on all columns beside the first one
4) ($i~/biological_process/) - check if string exists in col
5) printf FS"%s",$i - if string exists in column print the separator and the string
6) print "" - print new line
input file used:
TRINITY_DN173118_c0_g1 GO:0000139^cellular_component^Golgi membrane
TRINITY_DN49436_c2_g1 GO:0006351^biological_process^transcription, DNA-templated
TRINITY_DN47442_c0_g1 GO:0003677^molecular_function^DNA binding GO:0006302^biological_process^double-strand break repair GO:0006310^biological_process^DNA recombination
output
TRINITY_DN173118_c0_g1
TRINITY_DN49436_c2_g1 GO:0006351^biological_process^transcription, DNA-templated
TRINITY_DN47442_c0_g1 GO:0006302^biological_process^double-strand break repair GO:0006310^biological_process^DNA recombination
Thanks to Ed Morton for the feedback , I have edit the Answer :).
another similar awk
$ awk 'BEGIN {FS=OFS="\t"}
{line=$1;
for(i=2;i<=NF;i++) if($i~/biological_process/) line=line OFS $i;
print line}' file
TRINITY_DN173118_c0_g1
TRINITY_DN49436_c2_g1 GO:0006351^biological_process^transcription, DNA-templated
TRINITY_DN47442_c0_g1 GO:0006302^biological_process^double-strand break repair GO:0006310^biological_process^DNA recombination

Calling a function from awk with variable input location

I have a bunch of different files.We have used "|" as delimeter All files contain a column titled CARDNO, but not necessarily in the same location in all of the files. I have a function called data_mask. I want to apply to CARDNO in all of the files to change them into NEWCARDNO.
I know that if I pass in the column number of CARDNO I can do this pretty simply, say it's the 3rd column in a 5 column file with something like:
awk -v column=$COLNUMBER '{print $1, $2, FUNCTION($column), $4, $5}' FILE
However, if all of my files have hundreds of columns and it's somewhere arbitrary in each file, this is incredibly tedious. I am looking for a way to do something along the lines of this:
awk -v column=$COLNUMBER '{print #All columns before $column, FUNCTION($column), #All columns after $column}' FILE
My function takes a string as an input and changes it into a new one. It takes the value of the column as an input, not the column number. Please suggest me Unix command which can pass the column value to the function and give the desired output.
Thanks in advance
If I understand your problem correctly, the first row of the file is the header and one of those columns is named CARDNO. If this is the case then you just search for the header in that file and process accordingly.
awk 'BEGIN{FS=OFS="|";c=1}
(NR==1){while($c != "CARDNO" && c<=NF) c++
if(c>NF) exit
$c="NEWCARDNO" }
(NR!=1){$c=FUNCTION($c)}
{print}' <file>
As per comment, if there is no header in the file, but you know per file, which column number it is, then you can simply do:
awk -v c=$column 'BEGIN{FS=OFS="|"}{$c=FUNCTION($c)}1' <file>

Filter file based on internal value within a column

In UNIX, I would like to filter my 3 columns file based on the "DP" values that within the 3rd column.
I'd like to obtain only rows that have DP values higher than 7.
A|49.14|AC=2;AF=0.500;AN=4;BaseQRankSum=1.380;DP=6;Dels=0.00;
T|290.92|AC=2;AF=1.00;AN=2;DP=8;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;
T|294.75|AC=6;AF=1.00;AN=6;DP=9;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=6;
I'm using here "|" for separating between my three columns
Here's one simple solution
$ echo "A|49.14|AC=2;AF=0.500;AN=4;BaseQRankSum=1.380;DP=6;Dels=0.00;
AC=6;AF=1.00;AN=6;DP=9;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=6;T|290.92|AC=2;AF=1.00;AN=2;DP=8;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;
MLEAC=6;" \
| awk '{dpVal=$0;sub(/.*DP=/, "", dpVal);sub(/;.*$/,"", dpVal); if (dpVal>7) print}'
output
T|290.92|AC=2;AF=1.00;AN=2;DP=8;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;
T|294.75|AC=6;AF=1.00;AN=6;DP=9;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=6;
This makes a copy of each line ($0), the strips away everything before DP=, and everything after the trailing ; char for that field, leaving just the value for DP. That value is tested, and if true the whole line is printed (the default action of awk print is to print the whole line, but you can tell it to print anything you like, maybe print "Found it:" $0 or zillons of variants.
edit
I would like to keep all the first 53 lines intact and save them as well to my Output.txt file.
Yes, very easy, you're on the right track. With awk is is very easy to have multiple conditions process different parts or conditions in a file. Try this:
awk 'FNR <= 53 {print}
FNR > 53 {
vpVal=$0;sub(/.*DP=/, "", dpVal);sub(/;.*$/,"", dpVal)
if (dpVal>7) print
}' File.vcf > Output.txt
(i don't have a file to test with, so let me know if this isn't right).
IHTH

modify only one column of a big file and keep field seperator the same in unix

I have a very big file (more than 10000 columns). I would like to change 3 entries in the second column and keep anything else the same, including the field separator.
For example:
ab123\t123\t0.1
ab234\t120\t0.5
I would like to check if the second column has the entry 120 and change it 1201 and keep everything else the same.
I tried awk. It works fine but replaces the tab delimited with space.
awk '{ if ( $2 == 120 ) { $2 = 1201 }; print}' file
How can i do this without losing the tab delimited version?
You want to set FS (field separator) and OFS (output field separator) to tabs:
awk '$2==120{$2=1201}1' FS='\t' OFS='\t' file
OFS is the important variable here as awk uses it's value to separate the fields on output.
EDIT:
The structure of awk is conditional{block}, if the conditional is evaluated TRUE then the block is executed. So with $2==120{$2=1201} the conditional is $2==120 if the second field is the value 120 and the block is {$2=1201} assign the second field the value 1201. The default block in awk is {print $0} so:
awk '$2==120{$2=1201}{print $0}'
Can be re-written as:
awk '$2==120{$2=1201}1'
Where 1 is the conditional which always evaluates to TRUE and because we don't specify a block the default {print $0} is executed.
For multiple conditions just add more structures i.e:
awk '$2==120{$2=1201}$3==130{$3==1301}1'
This is more of an if if structure as both block can be executed and if else would use the next statement to jump to next line in the file i.e:
awk '$2==120{$2=1201;next}{$2==1202}1'
If the first block is executed here the second field is takes the value 1201 and we grab the next line else the second field will take the value 1202. So the second field will always take a new value, either 1201 or 1202.
An if elif would be:
awk '$2==120{$2=1201;next}$3==130{$3==1301}1'
Here the second field may take a new value, if it does the third field will not be updated even if the condition is true because it never get evaluated. The third field can only be updated if the first condition is FALSE and the second TRUE.
sed -r 's/^ *[^ ]+ +120\b/\01/' file

What does effect does a trailing number have on the body of an awk script?

I have a simple awk one liner that folds the next line onto the current line when a given pattern is matched. Here it is:
awk '/two/ { printf $1; next; } 1' test.txt
with the following input:
one
two
three
four
five
six
one
two
three
four
you get:
one
twothree
four
five
six
one
twothree
four
Note that the "three" is folded up with the "two" which is what I wanted to occur. I found this solution (don't remember where) and thought it odd that the trailing '1' appeared to be the missing piece that resulted in the desired effect. If you take it out, you get:
awk '/two/ { printf $1; next; }' test.txt
twotwo
As the result. So my question is, what does the trailing '1' do, and where is it documented?
Thanks!
For awk, 1 mean true, and when an expression is true, awk print the current line by default.
Examples :
awk '1' /etc/passwd will print the whole file
awk '0' /etc/passwd will not print the whole file
If you're good in arithmetic (or you have imagination or programming skills), there's many tricks you can do with this behaviour.
"The '1' at the end of whole Awk one-liner prints out the modified line (it's syntactic sugar for just "print" (that itself is syntactic sugar for "print $0"))".
-- catonmat
You got your answer to your specific question BUT your posted script is unsafe. The first argument for printf is a format string, not data. You can sometimes use it to print data but only when it's hard-coded in your script, not when you're reading it from input. Look what happens if your input happens to contain a printf formatting string:
$ cat file
one
two%s
three
four
$ awk '/two/ { printf $1; next; } 1' file
one
awk: cmd. line:1: (FILENAME=file2 FNR=2) fatal: not enough arguments to satisfy format string
`two%s'
^ ran out for this one
The correct way to write the script is:
$ awk '/two/ { printf "%s",$1; next; } 1' file
one
two%sthree
four

Resources