AWK command help - unix

I have a homework assignment and this is the question.
Using awk create a command that will display each field of a specific file.
Show the date at the beginning of the file with a line between and a title at the head of the output.
I have read the book and can't quite figure it out, here is what I have:
BEGIN {
{"date" | getline d }
{ printf "\t %s\n,d }
{ print "Heading\n" }
{ print "=====================\n }
}
{ code to display each field of file??? }

Some tips about awk:
The format of an awk program is
expression { action; ... }
expression { action; ... }
...
If the expression evaluates to true, then the action block is executed. Some examples of expressions:
BEGIN # true before any lines of input are read
END # true after the last line of input has been read
/pattern/ # true if the current line matches the pattern
NR < 10 # true if the current line number is less than 10
etc. The expression can be omitted if you want the action to be performed on every line.
So, your BEGIN block has too many braces:
BEGIN {
"date" | getline d
printf("\t %s\n\n",d)
print "Heading"
print "====================="
}
You could also write
BEGIN {
system("date")
print ""
print "Heading"
print "====================="
}
or execute the date command outside of awk and pass the result in as an awk variable
awk -v d="$(date)" '
BEGIN {
printf("%s\n\n%s\n%s\n",
d,
"heading",
"======")
}
The print command implicitly adds a newline to the output, so print "foo\n"; print "bar" will print a blank line after "foo". The printf command requires you to add newlines into your format string.
Can't help you more with "code to print each field". Luuk shows that print $0 will print all fields. If that doesn't meet your requirements, you'll have to be more specific.

{"date" | getline d }
why not simply print current date
{ print strftime("%y-%m-%d %H:%M"); }
and for:
{ code to display each field of file??? }
simply do
{ print $0; }
if you only wanted the first field you should do:
{ print $1; }
if you only want the second field:
{print $2; }
if you want only the last field:
{print $NF;}
Because NF is the number of field on a line......

Related

Need of awk command explaination

I want to know how the below command is working.
awk '/Conditional jump or move depends on uninitialised value/ {block=1} block {str=str sep $0; sep=RS} /^==.*== $/ {block=0; if (str!~/oracle/ && str!~/OCI/ && str!~/tuxedo1222/ && str!~/vprintf/ && str!~/vfprintf/ && str!~/vtrace/) { if (str!~/^$/){print str}} str=sep=""}' file_name.txt >> CondJump_val.txt
I'd also like to know how to check the texts Oracle, OCI, and so on from the second line only. 
The first step is to write it so it's easier to read
awk '
/Conditional jump or move depends on uninitialised value/ {block=1}
block {
str=str sep $0
sep=RS
}
/^==.*== $/ {
block=0
if (str!~/oracle/ && str!~/OCI/ && str!~/tuxedo1222/ && str!~/vprintf/ && str!~/vfprintf/ && str!~/vtrace/) {
if (str!~/^$/) {
print str
}
}
str=sep=""
}
' file_name.txt >> CondJump_val.txt
It accumulates the lines starting with "Conditional jump ..." ending with "==...== " into a variable str.
If the accumulated string does not match several patterns, the string is printed.
I'd also like to know how to check the texts Oracle, OCI, and so on from the second line only.
What does that mean? I assume you don't want to see the "Conditional jump..." line in the output. If that's the case then use the next command to jump to the next line of input.
/Conditional jump or move depends on uninitialised value/ {
block=1
next
}
perhaps consolidate those regex into a single chain ?
if (str !~ "oracle|OCI|tuxedo1222|v[f]?printf|vtrace") {
print str
}
There are two idiomatic awkisms to understand.
The first can be simplified to this:
$ seq 100 | awk '/^22$/{flag=1}
/^31$/{flag=0}
flag'
22
23
...
30
Why does this work? In awk, flag can be tested even if not yet defined which is what the stand alone flag is doing - the input is only printed if flag is true and flag=1 is only executed when after the regex /^22$/. The condition of flag being true ends with the regex /^31$/ in this simple example.
This is an idiom in awk to executed code between two regex matches on different lines.
In your case, the two regex's are:
/Conditional jump or move depends on uninitialised value/ # start
# in-between, block is true and collect the input into str separated by RS
/^==.*== $/ # end
The other 'awkism' is this:
block {str=str sep $0; sep=RS}
When block is true, collect $0 into str and first time though, RS should not be added in-between the last time. The result is:
str="first lineRSsecond lineRSthird lineRS..."
both depend on awk being able to use a undefined variable without error

Implement tr and sed functions in awk

I need to process a text file - a big CSV - to correct format in it. This CSV has a field which contains XML data, formatted to be human readable: break up into multiple lines and indentation with spaces. I need to have every record in one line, so I am using awk to join lines, and after that I am using sed, to get rid of extra spaces between XML tags, and after that tr to eliminate unwanted "\r" characters.
(the first record is always 8 numbers and the fiels separator is the pipe character: "|"
The awk scrips is (join4.awk)
BEGIN {
# initialise "line" variable. Maybe unnecessary
line=""
}
{
# check if this line is a beginning of a new record
if ( $0 ~ "^[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]|" ) {
# if it is a new record, then print stuff already collected
# then update line variable with $0
print line
line = $0
} else {
# if it is not, then just attach $0 to the line
line = line $0
}
}
END {
# print out the last record kept in line variable
if (line) print line
}
and the commandline is
cat inputdata.csv | awk -f join4.awk | tr -d "\r" | sed 's/> *</></g' > corrected_data.csv
My question is if there is an efficient way to implement tr and sed functionality inside the awk script? - this is not Linux, so I gave no gawk, just simple old awk and nawk.
thanks,
--Trifo
tr -d "\r"
Is just gsub(/\r/, "").
sed 's/> *</></g'
That's just gsub(/> *</, "><")
mawk NF=NF RS='\r?\n' FS='> *<' OFS='><'
Thank you all folks!
You gave me the inspiration to get to a solution. It is like this:
BEGIN {
# initialize "line" variable. Maybe unnecessary.
line=""
}
{
# if the line begins with 8 numbers and a pipe char (the format of the first record)...
if ( $0 ~ "^[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]\|" ) {
# ... then the previous record is ready. We can post process it, the print out
# workarounds for the missing gsub function
# removing extra spaces between xml tags
# removing extra \r characters the same way
while ( line ~ "\r") { sub( /\r/,"",line) }
# "<text text> <tag tag>" should look like "<text text><tag tag>"
while ( line ~ "> *<") { sub( /> *</,"><",line) }
# then print the record and update line var with the beginning of the new record
print line
line = $0
} else {
# just keep extending the record with the actual line
line = line $0
}
}
END {
# print the last record kept in line var
if (line) {
while ( line ~ "\r") { sub( /\r/,"",line) }
while ( line ~ "> *<") { sub( /> *</,"><",line) }
print line
}
}
And yes, it is efficient: the embedded version runs abou 33% faster.
And yes, it would be nicer to create a function for the postprocessing of the records in "line" variable. Now I have to write the same code twice to process the last recond in the END section. But it works, it creates the same output as the chained commands and it is way faster.
So, thanks for the inspiration again!
--Trifo

How does this awk line that counts the number of nucleotides in a fasta file work?

I am currently learning to use awk, and found an awk command that I needed, but do not fully understand what is happening in. This line of code takes a genome file called a fasta and returns all the length of each sequence in it. For those unfamiliar with fasta files, they are txt files that can contain multiple genetic sequences called contigs. It follows the general structure of:
>Nameofsequence
Sequencedata like: ATGCATCG
GCACGACTCGCTATATTATA
>Nameofsequence2
Sequencedata
The line is found here:
cat file.fa | awk '$0 ~ ">" {if (NR > 1) {print c;} c=0;printf substr($0,2,100) "\t"; } $0 !~ ">" {c+=length($0);} END { print c; }'
I understand that cat is opening the fasta file, checking if its the sequence name line, and at some point counting the number of characters in the data section. But I do not understand how it is breaking down the data section in substrings, nor how it is resetting the counts with each new sequence.
EDIT by Ed Morton: here's the above awk script formatted legibly by gawk -o-:
$0 ~ ">" {
if (NR > 1) {
print c
}
c = 0
printf substr($0, 2, 100) "\t"
}
$0 !~ ">" {
c += length($0)
}
END {
print c
}
First format the command:
awk '
$0 ~ ">" {
if (NR > 1) {print c;}
c=0;
printf substr($0,2,100) "\t";
}
$0 !~ ">" {
c+=length($0);
}
END { print c; }
' file.fa
The code will use c for a character count.This count starts with value 0, and will be reset to 0 every time a line with > is parsed.
The length of the inputline is added to c when the inputline is without a >.
The value of c must be printed after a sequence, so when it finds a new > (not on the first line) or when the complete file is parsed (block with END).
As you might already understand now:
breaking down the data section in substrings is by matching the inputline with a >, and
resetting the counts with each new sequence is done by using c=0 in the block with $0 ~ ">".
Look at the comment of Ed: The printf statement is used wrong. I don't know how often %s occurs in a fasta file, but that is not important: Use %s for input strings.
#WalterA already answered your question by explaining what the script does but in case you're interested here's an improved version including a couple of small bug fixes for your use of printf input and printing of an empty line if the input file is empty and improvements over the redundant testing of the same condition twice and testing for > and removing it separately instead of all at once:
BEGIN { OFS="\t" }
sub(/^>/,"") {
if (lgth) { print name, lgth }
name = $0
lgth = 0
next
}
{ lgth += length($0) }
END {
if (lgth) { print name, lgth }
}
Alternatively you could do:
BEGIN { OFS="\t" }
sub(/^>/,"") {
if (seq != "") { print name, length(seq) }
name = $0
seq = ""
next
}
{ seq = seq $0 }
END {
if (seq != "") { print name, length(seq) }
}
but appending to a variable is slow so calling length() for each line of the sequence may actually be more efficient.

Finding Records Using awk Command using math score

Write the unix command to display all the fields of students who has score more than 80 in math as well as math score should be top score among all subjects, moreover output should be in ascending order of std(standard)of the students.
INPUT:
roll,name,std,science_marks,math_marks,college
1,A,9,60,86,SM
2,B,10,85,80,DAV
3,C,10,95,92,DAV
4,D,9,75,92,DAV
OUTPUT:
1|A|9|60|86|SM
4|D|9|75|92|DAV
myCode:
awk 'BEGIN{FS=',' ; OFS="|"} {if($4<$5 && $5>80){print $1,$2,$3,$4,$5,$6}}'
but I'm getting unexpected token error please help me.
Error Message on my Mac System Terminal:
awk: syntax error at source line 1
context is
BEGIN >>> {FS=, <<<
awk: illegal statement at source line 1
Could you please try following, written and tested with shown samples in GNU awk. This answer doesn't hard code the field number it gathers column which has math in it and checks for rest of the lines accordingly then.
awk '
BEGIN{
FS=","
OFS="|"
}
FNR==1{
for(i=1;i<=NF;i++){
if($i=="math_marks"){ field=i }
}
next
}
{
for(i=3;i<=(NF-1);i++){
max=(max>$i?(max?max:$i):$i)
}
if(max==$field && $field>80){ $1=$1; print }
max=""
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section of code here.
FS="," ##Setting field separator as comma here.
OFS="|" ##Setting output field separator as | here for all lines.
}
FNR==1{ ##Checking condition if its first line then do following.
for(i=1;i<=NF;i++){ ##Going through all fields here.
if($i=="math_marks"){ field=i } ##Checking if a field value is math_marks then set field to tht field numner here.
}
next ##next will skip all further statements from here.
}
{
for(i=3;i<=(NF-1);i++){ ##Going through from 3rd field to 2nd last field here.
max=(max>$i?(max?max:$i):$i) ##Creating max variable which checks its value with current field and sets maximum value by comparison here.
}
if(max==$field && $field>80){ $1=$1; print } ##After processing of all fields checking if maximum and field value is equal AND math number field is greater than 80 then print the line.
max="" ##Nullifying max var here.
}
' Input_file ##Mentioning Input_file name here.
Your code has double quotes of wrong encoding:
here
| |
v v
$ busybox awk 'BEGIN{FS=”,” ; OFS="|"} {if($4<$5 && $5>80){print $1,$2,$3,$4,$5,$6}}'
awk: cmd. line:1: Unexpected token
Replace those and your code works fine.

awk Compare 2 files, print match and difference based on Start Range And End Range:

I need to comapre two files f1.txt and f2.txt and obtain matches, and non-matches, for this case
I am looking to check Second field of f2.txt is lying between StartRange and EndRange of f1.txt,if yes then print first the second field of f2.txt,
then print the entire line of f1.txt. And for no match found on f1.txt to state "Not Found" and then print f2.txt entire line.
f1.txt
Flag,StartRange,EndRange,Month
aa,1000,2000,cc,Jan-13
bb,2500,3000,cc,Feb-13
dd,5000,9000,cc,Mar-13
f2.txt
ss,1500
xx,500
gg,2800
yy,15000
Desired Output
ss,1500,aa,1000,2000,cc,Jan-13
xx,500,Not Found,Not Found,Not Found,Not Found
gg,2800,bb,2500,3000,cc,Feb-13
yy,15000,Not Found,Not Found,Not Found,Not Found
This might work for you:
gawk 'BEGIN {
FS="," # Field separator
c=1 # counter
while ((getline line < ARGV[1]) > 0) {
if (line !~ "Flag,StartRange,EndRange,Month") { # No need for header
F[c]=line; # store line
split(line,a,",") # split line
F2[c]=a[2] ; F3[c]=a[3] # store the lines' range parts
c++
}
}
}
FILENAME==ARGV[2] {
# Work on second file
for (i in F) { # For every line scan the first file
# if within a range, step out
if ($2>=F2[i] && $2<=F3[i]) {found=i ; break}
# else check next
else {found=0}
}
# if the above found anything print the line from second file
# with the relavant line from the first
if (found>0) {
print $0 "," F[found]
}
# otherwise the not found message
else {
print $0 ",Not Found,Not Found,Not Found,Not Found"
}
}' f1.txt f2.txt

Resources