How to parse selected expressions until an end keyword in pyparsing - pyparsing

I am parsing a verilog file to extract the dependencies within it. I don't need to parse most of the module contents, but am only interested in include statements and module instantiations. I first am attempting to only extract the include statements. This is my code so far:
include_pragma = Group(Keyword("`include") + quotedString + lineEnd.suppress())
module_definition = ZeroOrMore(~Keyword("endmodule") + MatchFirst([include_pragma, restOfLine])) + Keyword("endmodule")
My point is that I will either match an include pragma, or a line of anything until I reach "endmodule". If I try this grammer on the following string, pyparsing gets into some kind of infinite loop.
`include "InternalInclude.v"
localparam COMMA_WIDTH = 10;
localparam UNKNOWN = 1'b0,
KNOWN = 1'b1;
reg [DATA_WIDTH-1:0] TtiuPosQ2;
reg TtiuDetQ2;
reg [ 7:0] TtiuDetQ2Save;
assign DebugQO = { DataQI, // 65:34
TxTrainingEnQI, // 33
TtiuDetQ2, // 32
TtiuPosQ2 // 31:0
};
always #(posedge Clk or posedge ResetI) begin
if (ResetI) begin
CommaPosQO <= {(DATA_WIDTH){1'b0}};
CommaDetQO <= 0;
CommaPosQ1 <= {(DATA_WIDTH){1'b0}};
CommaPosQ2 <= {(DATA_WIDTH){1'b0}};
CommaDetQ2 <= 0;
StateQ <= 0;
CommaPosSaveQ <= {(DATA_WIDTH){1'b0}};
TtiuPosQ1 <= 0;
TtiuPosQ2 <= 0;
TtiuDetQ2 <= 0;
TtiuDetQ2Save <= 8'h00;
end
else begin
CommaPosQO <= CommaPosC2;
CommaDetQO <= CommaDetC2;
CommaPosQ1 <= CommaPosC;
CommaPosQ2 <= CommaPosQ1;
CommaDetQ2 <= (| CommaPosQ1);
StateQ <= StateC;
CommaPosSaveQ <= CommaPosSaveC;
TtiuPosQ1 <= TtiuPosC1;
TtiuPosQ2 <= TtiuPosC2;
TtiuDetQ2 <= TtiuDetC2;
TtiuDetQ2Save <= TtiuDetQ2 ? 8'hFF : {1'b0, TtiuDetQ2Save[7:1]};
end
end
endmodule
I am probably misunderstanding the ~ operator. Any suggestions?
update
After using the setDebug() method suggested, I found that the infinite loops prints this:
Match {Group:({"`include" quotedString using single or double quotes Suppress:(lineEnd)}) | Re:('.*')} at loc 0(1,1)
Matched {Group:({"`include" quotedString using single or double quotes Suppress:(lineEnd)}) | Re:('.*')} -> [['`include'
, '"InternalInclude.v"']]
Match {Group:({"`include" quotedString using single or double quotes Suppress:(lineEnd)}) | Re:('.*')} at loc 31(4,1)
Matched {Group:({"`include" quotedString using single or double quotes Suppress:(lineEnd)}) | Re:('.*')} -> ['']
Match {Group:({"`include" quotedString using single or double quotes Suppress:(lineEnd)}) | Re:('.*')} at loc 31(4,1)
Matched {Group:({"`include" quotedString using single or double quotes Suppress:(lineEnd)}) | Re:('.*')} -> ['']
Match {Group:({"`include" quotedString using single or double quotes Suppress:(lineEnd)}) | Re:('.*')} at loc 31(4,1)
Matched {Group:({"`include" quotedString using single or double quotes Suppress:(lineEnd)}) | Re:('.*')} -> ['']
Is something that I do causing the parse position not to move forward?

The problem was with using the restOfLine expression. It can match anything, including ''. So, the parser would proceed as follows:
Matched include "InternalInclude.v"
Match '' to restOfLine
Match the same '' to restOfLine because it did not consume the \n AND there is still '' left before the \n, as there alwasy will be ;)
To fix it, I changed restOfLine to (restOfLine + lineEnd). This forced the parser to consume the \n after matching the line. The new parser reads:
include_pragma = Group(Keyword("`include") + quotedString + lineEnd.suppress())
module_definition = ZeroOrMore(~Keyword("endmodule") + MatchFirst([include_pragma, (restOfLine + lineEnd)])) + Keyword("endmodule")

Related

Find a pattern and replace the pattern by adding a newline in front of it

Would like to replace a pattern with newline and preserving the pattern -
Sample input
:16R:ABC:20C::CORP:30E::ABC
would like to replace the pattern ":[0-9][0-9]" with a new line along with ":[0-9][0-9]" pattern
Output
:16R:ABC
:20C::CORP
:30E::ABC
Currently came up with -
echo ":16R:ABC:20C::CORP:30E::ABC" | sed 's/[:][0-9][0-9]/\
:/g;/^$/!P;D'
:R:ABC
:C::CORP
:E::ABC
Expected Output:
:16R:ABC
:20C::CORP
:30E::ABC
It's not preserving the pattern , any suggestions ?
Using a straightforward sed solution, POSIX-ly
sed 's/[A-Z]\([:][0-9][0-9][A-Z]\)/\n\1/g'
If you can use awk and have the GNU variant available, you can call patsplit() to split on the pattern :[0-9][0-9][A-Z] and start replacing from the 2nd occurrence onwards
awk '{ n = patsplit($0, arr, /[:][0-9][0-9][A-Z]/)
for (iter = 2; iter <= n; iter++)
sub(arr[iter], ORS arr[iter]);
}1'
or with any POSIX awk
awk '{ n = split($0, arr, /[:]/)
for (iter = 3; iter <= n; iter++)
if ( match(arr[iter], /[0-9][0-9][a-zA-Z]/) )
sub(":"arr[iter], ORS ":" arr[iter]);
}1'

calculate the percentage of not null recs in a file in unix

How do i figure out the percentage of not null records in my file in UNIX?
My file like this: I wanted to know the amount of records & the percentage of not null rec's. Tried whole lot of grep n cut commands but nothing seems to be working out. Can anyone help me here please...
"name","country","age","place"
"sam","US","30","CA"
"","","",""
"joe","UK","34","BRIS"
,,,,
"jake","US","66","Ohio"
Perl solution:
#!/usr/bin/perl
use warnings;
use strict;
use 5.012; # say, keys #arr
use Text::CSV_XS qw{ csv };
my ($count_all, #count_nonempty);
csv(in => shift,
out => \ 'skip',
headers => 'skip',
on_in => sub {
my (undef, $columns) = #_;
++$count_all;
length $columns->[$_] and $count_nonempty[$_]++
for 0 .. $#$columns;
},
);
for my $column (keys #count_nonempty) {
say "Column ", 1 + $column, ": ",
100 * $count_nonempty[$column] / $count_all, '%';
}
It uses Text::CSV_XS to read the CSV file. It skips the header line, and for each subsequent line, it calls the callback specified in on_in, which increments the count of all lines and also the count of empty fields per column if the length of a field is zero.
Along with choroba, I would normally recommend using a CSV parser on CSV data.
But in this case, all we want to look for is that a record contains any character that is not a comma or quote: if a record contains only commas and/or quotes, it is a "null" record.
awk '
/[^",]/ {nonnull++}
END {printf "%d / %d = %.2f\n", nonnull, NR, nonnull/NR}
' file
To handle leading/trailing whitespace
awk '
{sub(/^[[:blank:]]+/,""); sub(/[[:blank:]]+$/,"")}
/[^",]/ {nonnull++}
END {printf "%d / %d = %.2f\n", nonnull, NR, nonnull/NR}
' file
If allowing fields containing only whitespace, such as
" ","",,," "
is also a null record, we can simple ignore all whitespace
awk '
/[^",[:blank:]]/ {nonnull++}
END {printf "%d / %d = %.2f\n", nonnull, NR, nonnull/NR}
' file

How to replace second existing patteren in unix file

I want to replace the second existence of the pattern in unix.
Input File:-
12345|45345|TaskID|dksj|kdjfdsjf|TaskID|12
1245|425345|TaskID|dksj|kdjfdsjf|TaskID|12
1234|25345|TaskID|dksj|TaskID|kdjfdsjf|12|TaskID
123425|65345|TaskID|dksj|kdjfdsjf|12|TaskID
123425|15325|TaskID|dksj|kdjfdsjf|12
Sample Output file:-
12345|45345|TaskID|dksj|kdjfdsjf|TaskID1|12
1245|425345|TaskID2|dksj|kdjfdsjf|TaskID3|12
1234|25345|TaskID|dksj|TaskID1|kdjfdsjf|12|TaskID2
123425|65345|TaskID3|dksj|kdjfdsjf|12|TaskID4
123425|15325|TaskID|dksj|kdjfdsjf|12
your example does not match your question,
so i'll only show how to replace every second match of the given pattern
use awk. it's very powerfull tool for command line text processing
replace.sh as follow:
cat | awk -v search="$1" -v repl="$2" '
BEGIN {
flag = 0
}
{
split($0, a, search)
len = length(a)
for (f = 1; f < len; f += 1) {
printf "%s%s", a[f], (flag % 2 == 0 ? search : repl)
flag += 1
}
printf "%s%s", a[len], ORS
}
'
cat input.txt | ./replace.sh TaskID TaskID1

awk count and sum based on slab:

Would like to extract all the lines from first file (GunZip *.gz i.e Input.csv.gz), if the first file 4th field is falls within a range of
Second file (Slab.csv) first field (Start Range) and second field (End Range) then populate Slab wise count of rows and sum of 4th and 5th field of first file.
Input.csv.gz (GunZip)
Desc,Date,Zone,Duration,Calls
AB,01-06-2014,XYZ,450,3
AB,01-06-2014,XYZ,642,3
AB,01-06-2014,XYZ,0,0
AB,01-06-2014,XYZ,205,3
AB,01-06-2014,XYZ,98,1
AB,01-06-2014,XYZ,455,1
AB,01-06-2014,XYZ,120,1
AB,01-06-2014,XYZ,0,0
AB,01-06-2014,XYZ,193,1
AB,01-06-2014,XYZ,0,0
AB,01-06-2014,XYZ,161,2
Slab.csv
StartRange,EndRange
0,0
1,10
11,100
101,200
201,300
301,400
401,500
501,10000
Expected Output:
StartRange,EndRange,Count,Sum-4,Sum-5
0,0,3,0,0
1,10,NotFound,NotFound,NotFound
11,100,1,98,1
101,200,3,474,4
201,300,1,205,3
301,400,NotFound,NotFound,NotFound
401,500,2,905,4
501,10000,1,642,3
I am using below two commands to get the above output , expect "NotFound"cases .
awk -F, 'NR==FNR{s[NR]=$1;e[NR]=$2;c[NR]=$0;n++;next} {for(i=1;i<=n;i++) if($4>=s[i]&&$4<=e[i]) {print $0,","c[i];break}}' Slab.csv <(gzip -dc Input.csv.gz) >Op_step1.csv
cat Op_step1.csv | awk -F, '{key=$6","$7;++a[key];b[key]=b[key]+$4;c[key]=c[key]+$5} END{for(i in a)print i","a[i]","b[i]","c[i]}' >Op_step2.csv
Op_step2.csv
101,200,3,474,4
501,10000,1,642,3
0,0,3,0,0
401,500,2,905,4
11,100,1,98,1
201,300,1,205,3
Any suggestions to make it one liner command to achieve the Expected Output , Don't have perl , python access.
Here is another option using perl which takes benefits of creating multi-dimensional arrays and hashes.
perl -F, -lane'
BEGIN {
$x = pop;
## Create array of arrays from start and end ranges
## $range = ( [0,0] , [1,10] ... )
(undef, #range)= map { chomp; [split /,/] } <>;
#ARGV = $x;
}
## Skip the first line
next if $. ==1;
## Create hash of hash
## $line = '[0,0]' => { "count" => counts , "sum4" => sum_of_col4 , "sum5" => sum_of_col5 }
for (#range) {
if ($F[3] >= $_->[0] && $F[3] <= $_->[1]) {
$line{"#$_"}{"count"}++;
$line{"#$_"}{"sum4"} +=$F[3];
$line{"#$_"}{"sum5"} +=$F[4];
}
}
}{
print "StartRange,EndRange,Count,Sum-4,Sum-5";
print join ",", #$_,
$line{"#$_"}{"count"} //"NotFound",
$line{"#$_"}{"sum4"} //"NotFound",
$line{"#$_"}{"sum5"} //"NotFound"
for #range
' slab input
StartRange,EndRange,Count,Sum-4,Sum-5
0,0,3,0,0
1,10,NotFound,NotFound,NotFound
11,100,1,98,1
101,200,3,474,4
201,300,1,205,3
301,400,NotFound,NotFound,NotFound
401,500,2,905,4
501,10000,1,642,3
Here is one way using awk and sort:
awk '
BEGIN {
FS = OFS = SUBSEP = ",";
print "StartRange,EndRange,Count,Sum-4,Sum-5"
}
FNR == 1 { next }
NR == FNR {
ranges[$1,$2]++;
next
}
{
for (range in ranges) {
split(range, tmp, SUBSEP);
if ($4 >= tmp[1] && $4 <= tmp[2]) {
count[range]++;
sum4[range]+=$4;
sum5[range]+=$5;
next
}
}
}
END {
for(range in ranges)
print range, (count[range]?count[range]:"NotFound"), (sum4[range]?sum4[range]:"NotFound"), (sum5[range]?sum5[range]:"NotFound") | "sort -t, -nk1,2"
}' slab input
StartRange,EndRange,Count,Sum-4,Sum-5
0,0,3,NotFound,NotFound
1,10,NotFound,NotFound,NotFound
11,100,1,98,1
101,200,3,474,4
201,300,1,205,3
301,400,NotFound,NotFound,NotFound
401,500,2,905,4
501,10000,1,642,3
Set the Input, Output Field Separators and SUBSEP to ,. Print the Header line.
If it is the first line skip it.
Load the entire slab.txt in to an array called ranges.
For every range in the ranges array, split the field to get start and end range. If the 4th column is in the range, increment the count array and add the value to sum4 and sum5 array appropriately.
In the END block, iterate through the ranges and print them.
Pipe the output to sort to get the output in order.

How to check if the variable value in AWK script is null or empty?

I am using AWK script to process some logs.
At one place I need to check if the variable value is null or empty to make some decision.
Any Idea how to achieve the same?
awk '
{
{
split($i, keyVal, "#")
key=keyVal[1];
val=keyVal[2];
if(val ~ /^ *$/)
val="Y";
}
}
' File
I have tried with
1) if(val == "")
2) if(val ~ /^ *$/)
not working in both cases.
The comparison with "" should have worked, so that's a bit odd
As one more alternative, you could use the length() function, if zero, your variable is null/empty. E.g.,
if (length(val) == 0)
Also, perhaps the built-in variable NF (number of fields) could come in handy? Since we don't have access to your input data it's hard to say though, but another possibility.
You can directly use the variable without comparison, an empty/null/zero value is considered false, everything else is true.
See here :
# setting default tag if not provided
if (! tag) {
tag="default-tag"
}
So this script will have the variable tag with the value default-tag except if the user call it like this :
$ awk -v tag=custom-tag -f script.awk targetFile
This is true as of :
GNU Awk 4.1.3, API: 1.1 (GNU MPFR 3.1.4, GNU MP 6.1.0)
It works just fine for me
$ awk 'BEGIN{if(val==""){print "null or empty"}}'
null or empty
You can't differentiate between variable being empty and null, when you access "unset" variable, awk just initializes it with default value(here it is "" - empty string). You can use some sort of workaround, for example, setting val_accessed variable to 0 and then to 1 when you access it. Or more simple approach(somewhat "hackish") setting val to "unitialized"(or to some other value which can't appear when running your program).
PS: your script looks strange for me, what are the nested brackets for?
I accidentally discovered this less-used function specific in gawk that could help differentiate :
****** gawk-only ******
BEGIN {
$0 = "abc"
print NF, $0
test_function()
test_function($(NF + 1))
test_function("")
test_function($0)
}
function test_function(_) { print typeof(_) }
1 abc
untyped
unassigned
string
string
So it seems, for non-numeric-like data :
absolutely no input to function at all : untyped
non-existent or empty field, including $0 : unassigned
any non-numeric-appearing string, including "" : string
Here's the chaotic part - numeric data :
strangely enough, for absolutely identical input, only differing between using $0 vs. $1 in function call, you frequently get a different value for typeof()
even a combination of both leading and trailing spaces doesn't prevent gawk from identifying it as strnum
[123]:NF:1
$0 = number:123 $1 = strnum:123 +$1 = number:123
[ 456.33]:NF:1
$0 = string: 456.33 $1 = strnum:456.33 +$1 = number:456.33000
[ 19683 ]:NF:1
$0 = string: 19683 $1 = strnum:19683 +$1 = number:19683
[-20.08554]:NF:1
$0 = number:-20.08554 $1 = strnum:-20.08554 +$1 = number:-20.08554
+/- inf/nan (same for all 4):
[-nan]:NF:1
$0 = string:-nan $1 = strnum:-nan +$1 = number:-nan
this one is a string because it was made from sprintf() :
[0x10FFFF]:NF:1
$0 = string:0x10FFFF $1 = string:0x10FFFF +$1 = number:0
using -n / --non-decimal-data flag, all stays same except
[0x10FFFF]:NF:1
$0 = string:0x10FFFF $1 = strnum:0x10FFFF +$1 = number:1114111
Long story short, if you want your gawk function to be able to differentiate between
empty-string input (""), versus
actually no input at all
e.g. when original intention is to directly apply changes to $0
then typeof(x) == "untyped" seems to be the most reliable indicator.
It gets worse when null-string padding versus a non-empty string of all zeros ::
function __(_) { return (!_) ":" (!+_) }
function ___(_) { return (_ == "") }
function ____(_) { return (!_) ":" (!""_) }
$0--->[ "000" ] | __(""$0)-->{ !(""$0) : !+(""$0) }-->[ 0:1 ]
___($0)-->{ $0=="" }-->[ 0 ] | ____($0)-->{ ! $0 : (!""$0) }-->[ 1:1000 ]
$0--->[ "" ] | __(""$0)-->{ !(""$0) : !+(""$0) }-->[ 1:1 ]
___($0)-->{ $0=="" }-->[ 1 ] | ____($0)-->{ ! $0 : (!""$0) }-->[ 1:1 ]
$0--->[ " -0.0 -0" ] | __(""$0)-->{ !(""$0) : !+(""$0) }-->[ 0:1 ]
___($0)-->{ $0=="" }-->[ 0 ] | ____($0)-->{ ! $0 : (!""$0) }-->[ 0:1 -0.0 -0 ]
$0--->[ " 0x5" ] | __(""$0)-->{ !(""$0) : !+(""$0) }-->[ 0:1 ]
___($0)-->{ $0=="" }-->[ 0 ] | ____($0)-->{ ! $0 : (!""$0) }-->[ 0:1 0x5 ]

Resources