parsing tab separated header of a file in unix - unix

I'm trying to work out a generic script for getting the tab separated column values (as a header to the file). Separating via awk is fine for getting the column names. I'm not getting the idea as how to get each tab separated values until the end or the last NF(if using awk). The number of columns in the file isn't fix, sometime it might come up with 20 columns, sometime it could be 100, etc.
For ex: the tab separated columns in the file are-
abc ttr nnc r32 inc ...
If I write a simple awk as:
head -1 file | awk 'BEGIN {NF="\t"} {print $1,$2, etc}
It'd present each tab separated column represented by $1, $2, etc. I tried an incremental version by replacing $1, $2, etc by $i but wouldn't work.
Any ideas on this?

If I understand correctly, you are asking how to loop over the fields from 1 to NF. Here is an example of such a loop:
$ head -1 file | awk -F"\t" '{for (i=1;i<=NF;i++)printf "%s ",$i; print"";}'
abc ttr nnc r32 inc

Related

Linux - Get Substring from 1st occurence of character

FILE1.TXT
0020220101
or
01 20220101
Need to extra date part from file where text starts from 2
Options tried:
t_FILE_DT1='awk -F"2" '{PRINT $NF}' FILE1.TXT'
t_FILE_DT2='cut -d'2' -f2- FILE1.TXT'
echo "$t_FILE_DT1"
echo "$t_FILE_DT2"
1st output : 0101
2nd output : 0220101
Expected Output: 20220101
Im new to linux scripting. Could some one help guide where Im going wrong?
Use grep like so:
echo "0020220101\n01 20220101" | grep -P -o '\d{8}\b'
20220101
20220101
Here, GNU grep uses the following options:
-P : Use Perl regexes.
-o : Print the matches only (1 match per line), not the entire lines.
SEE ALSO:
grep manual
perlre - Perl regular expressions
Using any awk:
$ awk '{print substr($0,length()-7)}' file
20220101
20220101
The above was run on this input file:
$ cat file
0020220101
01 20220101
Regarding PRINT $NF in your question - PRINT != print. Get out of the habit of using all-caps unless you're writing Cobol. See correct-bash-and-shell-script-variable-capitalization for some reasons.
The 2 in your scripts is telling awka and cut to use the character 2 as the field separator so each will carve up the input into substrings everywhere a 2 occurs.
The 's in your question are single quotes used to make strings literal, you were intending to use backticks, `cmd`, but those are deprecated in favor of $(cmd) anyway.
I would instead of looking for "after" the 2 .. (not having to worry about whether there is a space involved as well) )
Think instead about extracting the last 8 characters, which you know for fact is your date ..
input="/path/to/txt/file/FILE1.TXT"
while IFS= read -r line
do
# read in the last 8 characters of $line .. You KNOW this is the date ..
# No need to worry about exact matching at that point, or spaces ..
myDate=${line: -8}
echo "$myDate"
done < "$input"
About the cut and awk commands that you tried:
Using awk -F"2" '{PRINT $NF}' file will set the field separator to 2, and $NF is the last field, so printing the value of the last field is 0101
Using cut -d'2' -f2- file uses a delimiter of 2 as well, and then print all fields starting at the second field, which is 0220101
If you want to match the 2 followed by 7 digits until the end of the string:
awk '
match ($0, /2[0-9]{7}$/) {
print substr($0, RSTART, RLENGTH)
}
' file
Output
20220101
The accepted answer shows how to extract the first eight digits, but that's not what you asked.
grep -o '2.*' file
will extract from the first occurrence of 2, and
grep -o '2[0-9]*' file
will extract all the digits after every occurrence of 2. If you specifically want eight digits, try
grep -Eo '2[0-9]{7}'
maybe also with a -w option if you want to only accept a match between two word boundaries. If you specifically want only digits after the first occurrence of 2, maybe try
sed -n 's/[^2]*\(2[0-9]*\).*/\1/p' file

Using awk on all columns for just part of column content

I trying to find a solution for the following. I have a list of gene IDs in my first column and in all the other columns the related GO terms. The number of columns behind each gene ID is therefor variable. As follows the first few lines:
TRINITY_DN173118_c0_g1 GO:0000139^cellular_component^Golgi membrane
TRINITY_DN49436_c2_g1 GO:0006351^biological_process^transcription, DNA-templated
TRINITY_DN47442_c0_g1 GO:0003677^molecular_function^DNA binding GO:0006302^biological_process^double-strand break repair GO:0006310^biological_process^DNA recombination
The GO terms are delimited with a tab. I want to keep the first column, with the IDs and all the columns that contain "biological_process". But how do I do that using awk, without a specific column to search in.
I basically want to use grep for columns, so was trying something with awk (but I am not experienced in awk at all):
awk '/biological_process/' -> I get the full line
awk '{ print "biological_process" }' -> I only get biological process
Can someone help me out? THanks!
AWK:
awk -F"GO:" '{printf "%s",$1}{for(i=2;i<=NF;i++) if ($i~/biological_process/)printf FS"%s",$i ;print ""}' file
1) -F"GO:" - use "GO:" string as separator
2) {printf "%s",$1} - print the first column (without new line)
3) for(i=2;i<=NF;i++) - run on all columns beside the first one
4) ($i~/biological_process/) - check if string exists in col
5) printf FS"%s",$i - if string exists in column print the separator and the string
6) print "" - print new line
input file used:
TRINITY_DN173118_c0_g1 GO:0000139^cellular_component^Golgi membrane
TRINITY_DN49436_c2_g1 GO:0006351^biological_process^transcription, DNA-templated
TRINITY_DN47442_c0_g1 GO:0003677^molecular_function^DNA binding GO:0006302^biological_process^double-strand break repair GO:0006310^biological_process^DNA recombination
output
TRINITY_DN173118_c0_g1
TRINITY_DN49436_c2_g1 GO:0006351^biological_process^transcription, DNA-templated
TRINITY_DN47442_c0_g1 GO:0006302^biological_process^double-strand break repair GO:0006310^biological_process^DNA recombination
Thanks to Ed Morton for the feedback , I have edit the Answer :).
another similar awk
$ awk 'BEGIN {FS=OFS="\t"}
{line=$1;
for(i=2;i<=NF;i++) if($i~/biological_process/) line=line OFS $i;
print line}' file
TRINITY_DN173118_c0_g1
TRINITY_DN49436_c2_g1 GO:0006351^biological_process^transcription, DNA-templated
TRINITY_DN47442_c0_g1 GO:0006302^biological_process^double-strand break repair GO:0006310^biological_process^DNA recombination

unix cut to extract column from text file and save rest of the contents to a new file

I can do the following using unix cut :
cut -f 1 myfile.out
Output:
6DKK463WXXK
VKFQ9PYP9CG
Since its printing out the column that I want to extract. How do I create the a new file without this column? In other words, I want to remove this column now and keep the rest of the content.
Depending on your version of Unix, you may use the negate option to select the fields not listed.
cut -f 2 --complement myfile.input > myfile.output
That will place all the columns from the input file into the output file, except for column 2.
You use the -d argument to specify a delimiter other than tab, which is the default.
Note from experience: Be careful with the > especially when using similar names for input and output so that you don't accidentally overwrite your input file (using tab completion, this is easy to do).
Example:
% echo one two three | cut -d ' ' -f 2 --complement
> one three
Gdang! S.O. must be swamped right now.
This is very easy in awk
echo "1 2 3 4 5" | awk -F" " '{sub(/[^ ]+ /,""); print}'
output
2 3 4 5
Deletes everything upto the first space character.
The the remaining record is printed.
IHTH

Filter file based on internal value within a column

In UNIX, I would like to filter my 3 columns file based on the "DP" values that within the 3rd column.
I'd like to obtain only rows that have DP values higher than 7.
A|49.14|AC=2;AF=0.500;AN=4;BaseQRankSum=1.380;DP=6;Dels=0.00;
T|290.92|AC=2;AF=1.00;AN=2;DP=8;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;
T|294.75|AC=6;AF=1.00;AN=6;DP=9;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=6;
I'm using here "|" for separating between my three columns
Here's one simple solution
$ echo "A|49.14|AC=2;AF=0.500;AN=4;BaseQRankSum=1.380;DP=6;Dels=0.00;
AC=6;AF=1.00;AN=6;DP=9;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=6;T|290.92|AC=2;AF=1.00;AN=2;DP=8;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;
MLEAC=6;" \
| awk '{dpVal=$0;sub(/.*DP=/, "", dpVal);sub(/;.*$/,"", dpVal); if (dpVal>7) print}'
output
T|290.92|AC=2;AF=1.00;AN=2;DP=8;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;
T|294.75|AC=6;AF=1.00;AN=6;DP=9;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=6;
This makes a copy of each line ($0), the strips away everything before DP=, and everything after the trailing ; char for that field, leaving just the value for DP. That value is tested, and if true the whole line is printed (the default action of awk print is to print the whole line, but you can tell it to print anything you like, maybe print "Found it:" $0 or zillons of variants.
edit
I would like to keep all the first 53 lines intact and save them as well to my Output.txt file.
Yes, very easy, you're on the right track. With awk is is very easy to have multiple conditions process different parts or conditions in a file. Try this:
awk 'FNR <= 53 {print}
FNR > 53 {
vpVal=$0;sub(/.*DP=/, "", dpVal);sub(/;.*$/,"", dpVal)
if (dpVal>7) print
}' File.vcf > Output.txt
(i don't have a file to test with, so let me know if this isn't right).
IHTH

How to grep content of file and create another file with grepped content?

I want to get grep content of file matching particular text and then want to save all those records which matches particular text to new file and also want to make sure that matched content is removed from original file.
296949657|QL|163744584|163744581|20441||
292465754|RE|W757|3012|301316469|00|
296950717|RC|7264|00001|013|27082856203|
292465754|QL|191427266|191427266|16405||
296950717|RC|7264|AETNAACTIVE|HHRPPO|27082856203|
299850356|RC|7700|153447|0891185100102-A|W19007007201|
292465754|RE|W757|3029|301316469|00|
299850356|RC|7700|153447|0891185100104-A|W19007007201|
293695591|QL|743559415|743559410|18452||
297348183|RC|6602|E924|0048|CD101699303|
297348183|RC|6602|E924|0051|CD101699303|
108327882|QL|613440276|613440275|17435||
I have written awk and it works as expected for small files but for larger files is not working as expected....am sure that i have missed something...
awk '{print $0 > ($0~/RC/?"RC_RECORDS":"TEST.DAT")}' TEST.DAT
any thoughts on how to fix this.
Update 1
Now in above file, i always want to check values of column two to |RC| and if it matches then move that record to RC_RECORDS file and if values matches to |RE| then move it to RE_RECORDS, how can this be done.
Case 1:
So for example if i have records as
108327882|RE|613440276|613440275|RC||
then it should go to RE_RECORDS file.
Case 2:
108327882|RC|613440276|613440275|RE||
then it should go to RE_RECORDS
Case 3:
108327882|QL|613440276|613440275|RC||
then it should not go to either RE_RECORDS or RC_RECORDS
Case 4:
108327882|QL|613440276|613440275|RE||
then it should not go to either RE_RECORDS or RC_RECORDS
My hunch is that
awk '/\|RC\|/ {print > "RC_RECORDS.DAT";next} {print > "NEWTEST.DAT"}' TEST.DAT | awk '$2 == "RC"'
awk '/\|RE\|/ {print > "RE_RECORDS.DAT";next} {print > "FINAL_NEWTEST.DAT"}' NEWTEST.DAT | awk '$2 == "RE"'
but wanted to check if there's an better and quicker solution out there that can be used.
Update 2
Update 3
I think this is what you want:
Option 1
awk -F'|' '
$2=="RC" {print >"RC_RECORDS.TXT";next}
$2=="RE" {print >"RE_RECORDS.TXT";next}
{print >"OTHER_RECORDS.TXT"}' file
You can put it all on one line if you prefer, like this:
awk -F'|' '$2=="RC"{print >"RC_RECORDS.TXT";next} $2=="RE"{print >"RE_RECORDS.TXT";next}{print >"OTHER_RECORDS.TXT"}' file
Option 2
Or you can see how grep compares for speed/readability:
grep -E "^[[:alnum:]]+\|RC\|" file > RC_RECORDS.TXT &
grep -E "^[[:alnum:]]+\|RE\|" file > RE_RECORDS.TXT &
grep -vE "^[[:alnum:]]+\|R[CE]" file > OTHER_RECORDS.TXT &
wait
Option 3
This solution uses 2 awk processes and maybe achieves better parallelism in the I/O. The first awk extracts the RC records to a file and passes the rest onwards. The second awk extracts the RE records to a file and passes the rest on to be written to the OTHER_RECORDS.TXT file.
awk -F'|' '$2=="RC"{print >"RC_RECORDS.TXT";next} 1' file | awk -F'|' '$2=="RE"{print >"RE_RECORDS.TXT";next} 1' > OTHER_RECORDS.TXT
I created an 88M record file (3 GB), and ran some test on a desktop iMac as follows:
Option 1: 65 seconds
Option 2: 92 seconds
Option 3: 53 seconds
Your mileage may vary.
My file looks like this, i.e. 33% RE records, 33% RC records and rest junk:
00000000|RE|abcdef|ghijkl|mnopq|rstu
00000001|RC|abcdef|ghijkl|mnopq|rstu
00000002|XX|abcdef|ghijkl|mnopq|rstu
00000003|RE|abcdef|ghijkl|mnopq|rstu
00000004|RC|abcdef|ghijkl|mnopq|rstu
00000005|XX|abcdef|ghijkl|mnopq|rstu
00000006|RE|abcdef|ghijkl|mnopq|rstu
00000007|RC|abcdef|ghijkl|mnopq|rstu
00000008|XX|abcdef|ghijkl|mnopq|rstu
00000009|RE|abcdef|ghijkl|mnopq|rstu
Sanity Check
wc -l *TXT
29333333 OTHER_RECORDS.TXT
29333333 RC_RECORDS.TXT
29333334 RE_RECORDS.TXT
88000000 total

Resources