I need to process a text file - a big CSV - to correct format in it. This CSV has a field which contains XML data, formatted to be human readable: break up into multiple lines and indentation with spaces. I need to have every record in one line, so I am using awk to join lines, and after that I am using sed, to get rid of extra spaces between XML tags, and after that tr to eliminate unwanted "\r" characters.
(the first record is always 8 numbers and the fiels separator is the pipe character: "|"
The awk scrips is (join4.awk)
BEGIN {
# initialise "line" variable. Maybe unnecessary
line=""
}
{
# check if this line is a beginning of a new record
if ( $0 ~ "^[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]|" ) {
# if it is a new record, then print stuff already collected
# then update line variable with $0
print line
line = $0
} else {
# if it is not, then just attach $0 to the line
line = line $0
}
}
END {
# print out the last record kept in line variable
if (line) print line
}
and the commandline is
cat inputdata.csv | awk -f join4.awk | tr -d "\r" | sed 's/> *</></g' > corrected_data.csv
My question is if there is an efficient way to implement tr and sed functionality inside the awk script? - this is not Linux, so I gave no gawk, just simple old awk and nawk.
thanks,
--Trifo
tr -d "\r"
Is just gsub(/\r/, "").
sed 's/> *</></g'
That's just gsub(/> *</, "><")
mawk NF=NF RS='\r?\n' FS='> *<' OFS='><'
Thank you all folks!
You gave me the inspiration to get to a solution. It is like this:
BEGIN {
# initialize "line" variable. Maybe unnecessary.
line=""
}
{
# if the line begins with 8 numbers and a pipe char (the format of the first record)...
if ( $0 ~ "^[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]\|" ) {
# ... then the previous record is ready. We can post process it, the print out
# workarounds for the missing gsub function
# removing extra spaces between xml tags
# removing extra \r characters the same way
while ( line ~ "\r") { sub( /\r/,"",line) }
# "<text text> <tag tag>" should look like "<text text><tag tag>"
while ( line ~ "> *<") { sub( /> *</,"><",line) }
# then print the record and update line var with the beginning of the new record
print line
line = $0
} else {
# just keep extending the record with the actual line
line = line $0
}
}
END {
# print the last record kept in line var
if (line) {
while ( line ~ "\r") { sub( /\r/,"",line) }
while ( line ~ "> *<") { sub( /> *</,"><",line) }
print line
}
}
And yes, it is efficient: the embedded version runs abou 33% faster.
And yes, it would be nicer to create a function for the postprocessing of the records in "line" variable. Now I have to write the same code twice to process the last recond in the END section. But it works, it creates the same output as the chained commands and it is way faster.
So, thanks for the inspiration again!
--Trifo
Related
I want to know how the below command is working.
awk '/Conditional jump or move depends on uninitialised value/ {block=1} block {str=str sep $0; sep=RS} /^==.*== $/ {block=0; if (str!~/oracle/ && str!~/OCI/ && str!~/tuxedo1222/ && str!~/vprintf/ && str!~/vfprintf/ && str!~/vtrace/) { if (str!~/^$/){print str}} str=sep=""}' file_name.txt >> CondJump_val.txt
I'd also like to know how to check the texts Oracle, OCI, and so on from the second line only.
The first step is to write it so it's easier to read
awk '
/Conditional jump or move depends on uninitialised value/ {block=1}
block {
str=str sep $0
sep=RS
}
/^==.*== $/ {
block=0
if (str!~/oracle/ && str!~/OCI/ && str!~/tuxedo1222/ && str!~/vprintf/ && str!~/vfprintf/ && str!~/vtrace/) {
if (str!~/^$/) {
print str
}
}
str=sep=""
}
' file_name.txt >> CondJump_val.txt
It accumulates the lines starting with "Conditional jump ..." ending with "==...== " into a variable str.
If the accumulated string does not match several patterns, the string is printed.
I'd also like to know how to check the texts Oracle, OCI, and so on from the second line only.
What does that mean? I assume you don't want to see the "Conditional jump..." line in the output. If that's the case then use the next command to jump to the next line of input.
/Conditional jump or move depends on uninitialised value/ {
block=1
next
}
perhaps consolidate those regex into a single chain ?
if (str !~ "oracle|OCI|tuxedo1222|v[f]?printf|vtrace") {
print str
}
There are two idiomatic awkisms to understand.
The first can be simplified to this:
$ seq 100 | awk '/^22$/{flag=1}
/^31$/{flag=0}
flag'
22
23
...
30
Why does this work? In awk, flag can be tested even if not yet defined which is what the stand alone flag is doing - the input is only printed if flag is true and flag=1 is only executed when after the regex /^22$/. The condition of flag being true ends with the regex /^31$/ in this simple example.
This is an idiom in awk to executed code between two regex matches on different lines.
In your case, the two regex's are:
/Conditional jump or move depends on uninitialised value/ # start
# in-between, block is true and collect the input into str separated by RS
/^==.*== $/ # end
The other 'awkism' is this:
block {str=str sep $0; sep=RS}
When block is true, collect $0 into str and first time though, RS should not be added in-between the last time. The result is:
str="first lineRSsecond lineRSthird lineRS..."
both depend on awk being able to use a undefined variable without error
I am currently learning to use awk, and found an awk command that I needed, but do not fully understand what is happening in. This line of code takes a genome file called a fasta and returns all the length of each sequence in it. For those unfamiliar with fasta files, they are txt files that can contain multiple genetic sequences called contigs. It follows the general structure of:
>Nameofsequence
Sequencedata like: ATGCATCG
GCACGACTCGCTATATTATA
>Nameofsequence2
Sequencedata
The line is found here:
cat file.fa | awk '$0 ~ ">" {if (NR > 1) {print c;} c=0;printf substr($0,2,100) "\t"; } $0 !~ ">" {c+=length($0);} END { print c; }'
I understand that cat is opening the fasta file, checking if its the sequence name line, and at some point counting the number of characters in the data section. But I do not understand how it is breaking down the data section in substrings, nor how it is resetting the counts with each new sequence.
EDIT by Ed Morton: here's the above awk script formatted legibly by gawk -o-:
$0 ~ ">" {
if (NR > 1) {
print c
}
c = 0
printf substr($0, 2, 100) "\t"
}
$0 !~ ">" {
c += length($0)
}
END {
print c
}
First format the command:
awk '
$0 ~ ">" {
if (NR > 1) {print c;}
c=0;
printf substr($0,2,100) "\t";
}
$0 !~ ">" {
c+=length($0);
}
END { print c; }
' file.fa
The code will use c for a character count.This count starts with value 0, and will be reset to 0 every time a line with > is parsed.
The length of the inputline is added to c when the inputline is without a >.
The value of c must be printed after a sequence, so when it finds a new > (not on the first line) or when the complete file is parsed (block with END).
As you might already understand now:
breaking down the data section in substrings is by matching the inputline with a >, and
resetting the counts with each new sequence is done by using c=0 in the block with $0 ~ ">".
Look at the comment of Ed: The printf statement is used wrong. I don't know how often %s occurs in a fasta file, but that is not important: Use %s for input strings.
#WalterA already answered your question by explaining what the script does but in case you're interested here's an improved version including a couple of small bug fixes for your use of printf input and printing of an empty line if the input file is empty and improvements over the redundant testing of the same condition twice and testing for > and removing it separately instead of all at once:
BEGIN { OFS="\t" }
sub(/^>/,"") {
if (lgth) { print name, lgth }
name = $0
lgth = 0
next
}
{ lgth += length($0) }
END {
if (lgth) { print name, lgth }
}
Alternatively you could do:
BEGIN { OFS="\t" }
sub(/^>/,"") {
if (seq != "") { print name, length(seq) }
name = $0
seq = ""
next
}
{ seq = seq $0 }
END {
if (seq != "") { print name, length(seq) }
}
but appending to a variable is slow so calling length() for each line of the sequence may actually be more efficient.
I have a text file in unix formed from multiple long lines
ALTER Tit como(titel('42423432;434235111;757567562;2354679;5543534;6547673;32322332;54545453'))
ALTER Mit como(Alt('432322;434434211;754324237562;2354679;5543534;6547673;32322332;54545453'))
I need to split each line in multiple lines of no longer than 42 characters.
The split should be done at the end of last ";", and
so my ideal output file will be :
ALTER Tit como(titel('42423432;434235111; -
757567562;2354679;5543534;6547673; -
32322332;54545453'))
ALTER Mit como(Alt('432322;434434211; -
754324237562;2354679;5543534;6547673; -
32322332;54545453'))
I used fold -w 42 givenfile.txt | sed 's/ $/ -/g'
it splits the line but doesnt add the "-" at the end of the line and doesnt split after the ";".
any help is much appreciated.
Thanks !
awk -F';' '
w{
print""
}
{
w=length($1)
printf "%s",$1
for (i=2;i<=NF;i++){
if ((w+length($i)+1)<42){
w+=length($i)+1
printf";%s",$i
} else {
w=length($i)
printf"; -\n%s",$i
}
}
}
END{
print""
}
' file
This produces the output:
ALTER Tit como(titel('42423432;434235111; -
757567562;2354679;5543534;6547673; -
32322332;54545453'))
ALTER Mit como(Alt('432322;434434211; -
754324237562;2354679;5543534;6547673; -
32322332;54545453'))
How it works
Awk implicitly loops through each line of its input and each line is divided into fields. This code uses a single variable w to keep track of the current width of the output line.
-F';'
Tell awk to break fields on semicolons.
`w{print""}
If the last line was not completed, w>0, then print a newline to terminate it before we start with a new line.
w=length($1); printf "%s",$1
Print the first field of the new line and set w according to its length.
Loop over the remaining fields:
for (i=2;i<=NF;i++){
if ((w+length($i)+1)<42){
w+=length($i)+1
printf";%s",$i
} else {
w=length($i)
printf"; -\n%s",$i
}
}
This loops over the second to final fields of this line. Whenever we reach the point where we can't print another field without exceeding the 42 character limit, we print ; -\n.
END{print""}
Print a newline at the end of the file.
This might work for you (GNU sed):
sed -r 's/.{1,42}$|.{1,41};/& -\n/g;s/...$//' file
This globally replaces 1 to 41 characters followed by a ; or 1 to 42 characters followed by end of line with -\n. The last string will have three characters too many and so they are deleted.
I have a homework assignment and this is the question.
Using awk create a command that will display each field of a specific file.
Show the date at the beginning of the file with a line between and a title at the head of the output.
I have read the book and can't quite figure it out, here is what I have:
BEGIN {
{"date" | getline d }
{ printf "\t %s\n,d }
{ print "Heading\n" }
{ print "=====================\n }
}
{ code to display each field of file??? }
Some tips about awk:
The format of an awk program is
expression { action; ... }
expression { action; ... }
...
If the expression evaluates to true, then the action block is executed. Some examples of expressions:
BEGIN # true before any lines of input are read
END # true after the last line of input has been read
/pattern/ # true if the current line matches the pattern
NR < 10 # true if the current line number is less than 10
etc. The expression can be omitted if you want the action to be performed on every line.
So, your BEGIN block has too many braces:
BEGIN {
"date" | getline d
printf("\t %s\n\n",d)
print "Heading"
print "====================="
}
You could also write
BEGIN {
system("date")
print ""
print "Heading"
print "====================="
}
or execute the date command outside of awk and pass the result in as an awk variable
awk -v d="$(date)" '
BEGIN {
printf("%s\n\n%s\n%s\n",
d,
"heading",
"======")
}
The print command implicitly adds a newline to the output, so print "foo\n"; print "bar" will print a blank line after "foo". The printf command requires you to add newlines into your format string.
Can't help you more with "code to print each field". Luuk shows that print $0 will print all fields. If that doesn't meet your requirements, you'll have to be more specific.
{"date" | getline d }
why not simply print current date
{ print strftime("%y-%m-%d %H:%M"); }
and for:
{ code to display each field of file??? }
simply do
{ print $0; }
if you only wanted the first field you should do:
{ print $1; }
if you only want the second field:
{print $2; }
if you want only the last field:
{print $NF;}
Because NF is the number of field on a line......
I have files with these kind of duplicate lines, where only the last field is different:
OST,0202000070,01-AUG-09,002735,6,0,0202000068,4520688,-1,0,0,0,0,0,55
ONE,0208076826,01-AUG-09,002332,316,3481.055935,0204330827,29150,200,0,0,0,0,0,5
ONE,0208076826,01-AUG-09,002332,316,3481.055935,0204330827,29150,200,0,0,0,0,0,55
OST,0202000068,01-AUG-09,003019,6,0,0202000071,4520690,-1,0,0,0,0,0,55
I need to remove the first occurrence of the line and leave the second one.
I've tried:
awk '!x[$0]++ {getline; print $0}' file.csv
but it's not working as intended, as it's also removing non duplicate lines.
#!/bin/awk -f
{
s = substr($0, 0, match($0, /,[^,]+$/))
if (!seen[s]) {
print $0
seen[s] = 1
}
}
If your near-duplicates are always adjacent, you can just compare to the previous entry and avoid creating a potentially huge associative array.
#!/bin/awk -f
{
s = substr($0, 0, match($0, /,[^,]*$/))
if (s != prev) {
print prev0
}
prev = s
prev0 = $0
}
END {
print $0
}
Edit: Changed the script so it prints the last one in a group of near-duplicates (no tac needed).
As a general strategy (I'm not much of an AWK pro despite taking classes with Aho) you might try:
Concatenate all the fields except
the last.
Use this string as a key to a hash.
Store the entire line as the value
to a hash.
When you have processed all lines,
loop through the hash printing out
the values.
This isn't AWK specific and I can't easily provide any sample code, but this is what I would first try.