pyparsing parse c/cpp enums with values as user defined macros - pyparsing

I have a usecase where i need to match enums where values can be userdefined macros.
Example enum
typedef enum
{
VAL_1 = -1
VAL_2 = 0,
VAL_3 = 0x10,
VAL_4 = **TEST_ENUM_CUSTOM(1,2)**,
}MyENUM;
I am using the below code, if i don't use format as in VAL_4 it works. I need match format as in VAL_4 as well. I am new to pyparsing, any help is appeciated.
My code:
BRACE, RBRACE, EQ, COMMA = map(Suppress, "{}=,")
_enum = Suppress("enum")
identifier = Word(alphas, alphanums + "_")
integer = Word("-"+alphanums) **#I have tried to "_(,)" to this but is not matching.**
enumValue = Group(identifier("name") + Optional(EQ + integer("value")))
enumList = Group(enumValue + ZeroOrMore(COMMA + enumValue) + Optional(COMMA))
enum = _enum + Optional(identifier("enum")) + LBRACE + enumList("names") + RBRACE + Optional(identifier("typedef"))
enum.ignore(cppStyleComment)
enum.ignore(cStyleComment)
Thanks
-Purna

Just adding more characters to integer is just the wrong way to go. Even this expression:
integer = Word("-"+alphanums)
isn't super-great, since it would match "---", "xyz", "q--10-", and many other non-integer strings.
Better to define integer properly. You could do:
integer = Combine(Optional('-') + Word(nums))
but I've found that for these low-level expressions that occur many places in your parse string, a Regex is best:
integer = Regex(r"-?\d+") # Regex(r"-?[0-9]+") if you like more readable re's
Then define one for hex_integer also,
Then to add macros, we need a recursive expression, to handle the possibility of macros having arguments that are also macros.
So at this point, we should just stop writing code for a bit, and do some design. In parser development, this design usually looks like a BNF, where you describe your parser in a sort of pseudocode:
enum_expr ::= "typedef" "enum" [identifier]
"{"
enum_item_list
"}" [identifier] ";"
enum_item_list ::= enum_item ["," enum_item]... [","]
enum_item ::= identifier "=" enum_value
enum_value ::= integer | hex_integer | macro_expression
macro_expression ::= identifier "(" enum_value ["," enum_value]... ")"
Note the recursion of macro_expression: it is used in defining enum_value, but it includes enum_value as part of its own definition. In pyparsing, we use a Forward to set up this kind of recursion.
See how that BNF is implemented in the code below. I build on some of the items you posted, but the macro expression required some rework. The bottom line is "don't just keep adding characters to integer trying to get something to work."
LBRACE, RBRACE, EQ, COMMA, LPAR, RPAR, SEMI = map(Suppress, "{}=,();")
_typedef = Keyword("typedef").suppress()
_enum = Keyword("enum").suppress()
identifier = Word(alphas, alphanums + "_")
# define an enumValue expression that is recursive, so that enumValues
# that are macros can take parameters that are enumValues
enumValue = Forward()
# add more types as needed - parse action on hex_integer will do parse-time
# conversion to int
integer = Regex(r"-?\d+").addParseAction(lambda t: int(t[0]))
# or just use the signed_integer expression found in pyparsing_common
# integer = pyparsing_common.signed_integer
hex_integer = Regex(r"0x[0-9a-fA-F]+").addParseAction(lambda t: int(t[0], 16))
# a macro defined using enumValue for parameters
macro_expr = Group(identifier + LPAR + Group(delimitedList(enumValue)) + RPAR)
# use '<<=' operator to attach recursive definition to enumValue
enumValue <<= hex_integer | integer | macro_expr
# remaining enum expressions
enumItem = Group(identifier("name") + Optional(EQ + enumValue("value")))
enumList = Group(delimitedList(enumItem) + Optional(COMMA))
enum = (_typedef
+ _enum
+ Optional(identifier("enum"))
+ LBRACE
+ enumList("names")
+ RBRACE
+ Optional(identifier("typedef"))
+ SEMI
)
# this comment style includes cStyleComment too, so no need to
# ignore both
enum.ignore(cppStyleComment)
Try it out:
enum.runTests([
"""
typedef enum
{
VAL_1 = -1,
VAL_2 = 0,
VAL_3 = 0x10,
VAL_4 = TEST_ENUM_CUSTOM(1,2)
}MyENUM;
""",
])
runTests is for testing and debugging your parser during development. Use enum.parseString(some_enum_expression) or enum.searchString(some_c_header_file_text) to get the actual parse results.
Using the new railroad diagram feature in the upcoming pyparsing 3.0 release, here is a visual representation of this parser:

Related

Need help understanding how gsub and tonumber are used to encode lua source code?

I'm new to LUA but figured out that gsub is a global substitution function and tonumber is a converter function. What I don't understand is how the two functions are used together to produce an encoded string.
I've already tried reading parts of PIL (Programming in Lua) and the reference manual but still, am a bit confused.
local L0_0, L1_1
function L0_0(A0_2)
return (A0_2:gsub("..", function(A0_3)
return string.char((tonumber(A0_3, 16) + 256 - 13 + 255999744) % 256)
end))
end
encodes = L0_0
L0_0 = gg
L0_0 = L0_0.toast
L1_1 = "__loading__\226\128\166"
L0_0(L1_1)
L0_0 = encodes
L1_1 = --"The Encoded String"
L0_0 = L0_0(L1_1)
L1_1 = load
L1_1 = L1_1(L0_0)
pcall(L1_1)
I removed the encoded string where I put the comment because of how long it was. If needed I can upload the encoded string as well.
gsub is being used to get 2 digit sections of A0_2. This means the string A0_3 is a 2 digit hexadecimal number but it is not in a number format so we cannot preform math on the value. A0_3 being a hex number can be inferred based on how tonubmer is used.
tonumber from Lua 5.1 Reference Manual:
Tries to convert its argument to a number. If the argument is already a number or a string convertible to a number, then tonumber returns this number; otherwise, it returns nil.
An optional argument specifies the base to interpret the numeral. The base may be any integer between 2 and 36, inclusive. In bases above 10, the letter 'A' (in either upper or lower case) represents 10, 'B' represents 11, and so forth, with 'Z' representing 35. In base 10 (the default), the number can have a decimal part, as well as an optional exponent part (see ยง2.1). In other bases, only unsigned integers are accepted.
So tonumber(A0_3, 16) means we are expecting for A0_3 to be a base 16 number (hexadecimal).
Once we have the number value of A0_3 we do some math and finally convert it to a character.
function L0_0(A0_2)
return (A0_2:gsub("..", function(A0_3)
return string.char((tonumber(A0_3, 16) + 256 - 13 + 255999744) % 256)
end))
end
This block of code takes a string of hex digits and converts them into chars. tonumber is being used to allow for the manipulation of the values.
Here is an example of how this works with Hello World:
local str = "Hello World"
local hex_str = ''
for i = 1, #str do
hex_string = hex_string .. string.format("%x", str:byte(i,i))
end
function L0_0(A0_2)
return (A0_2:gsub("..", function(A0_3)
return string.char((tonumber(A0_3, 16) + 256 - 13 + 255999744) % 256)
end))
end
local encoded = L0_0(hex_str)
print(encoded)
Output
;X__bJbe_W
And taking it back to the orginal string:
function decode(A0_2)
return (A0_2:gsub("..", function(A0_3)
return string.char((tonumber(A0_3, 16) + 13) % 256)
end))
end
hex_string = ''
for i = 1, #encoded do
hex_string = hex_string .. string.format("%x", encoded:byte(i,i))
end
print(decode(hex_string))

How do you access name of a ProtoField after declaration?

How can I access the name property of a ProtoField after I declare it?
For example, something along the lines of:
myproto = Proto("myproto", "My Proto")
myproto.fields.foo = ProtoField.int8("myproto.foo", "Foo", base.DEC)
print(myproto.fields.foo.name)
Where I get the output:
Foo
An alternate method that's a bit more terse:
local fieldString = tostring(field)
local i, j = string.find(fieldString, ": .* myproto")
print(string.sub(fieldString, i + 2, j - (1 + string.len("myproto")))
EDIT: Or an even simpler solution that works for any protocol:
local fieldString = tostring(field)
local i, j = string.find(fieldString, ": .* ")
print(string.sub(fieldString, i + 2, j - 1))
Of course the 2nd method only works as long as there are no spaces in the field name. Since that's not necessarily always going to be the case, the 1st method is more robust. Here is the 1st method wrapped up in a function that ought to be usable by any dissector:
-- The field is the field whose name you want to print.
-- The proto is the name of the relevant protocol
function printFieldName(field, protoStr)
local fieldString = tostring(field)
local i, j = string.find(fieldString, ": .* " .. protoStr)
print(string.sub(fieldString, i + 2, j - (1 + string.len(protoStr)))
end
... and here it is in use:
printFieldName(myproto.fields.foo, "myproto")
printFieldName(someproto.fields.bar, "someproto")
Ok, this is janky, and certainly not the 'right' way to do it, but it seems to work.
I discovered this after looking at the output of
print(tostring(myproto.fields.foo))
This seems to spit out the value of each of the members of ProtoField, but I couldn't figure out the correct way to access them. So, instead, I decided to parse the string. This function will return 'Foo', but could be adapted to return the other fields as well.
function getname(field)
--First, convert the field into a string
--this is going to result in a long string with
--a bunch of info we dont need
local fieldString= tostring(field)
-- fieldString looks like:
-- ProtoField(188403): Foo myproto.foo base.DEC 0000000000000000 00000000 (null)
--Split the string on '.' characters
a,b=fieldString:match"([^.]*).(.*)"
--Split the first half of the previous result (a) on ':' characters
a,b=a:match"([^.]*):(.*)"
--At this point, b will equal " Foo myproto"
--and we want to strip out that abreviation "abvr" part
--Count the number of times spaces occur in the string
local spaceCount = select(2, string.gsub(b, " ", ""))
--Declare a counter
local counter = 0
--Declare the name we are going to return
local constructedName = ''
--Step though each word in (b) separated by spaces
for word in b:gmatch("%w+") do
--If we hav reached the last space, go ahead and return
if counter == spaceCount-1 then
return constructedName
end
--Add the current word to our name
constructedName = constructedName .. word .. " "
--Increment counter
counter = counter+1
end
end

HtmlProvider parses Fraction As DateTime

Using HtmlProvider to access a web-based table sometimes returns a fraction as a string (correct) and, at other times, returns a DateTime (incorrect).
What am I missing?
module Test =
open FSharp.Data
let [<Literal>] url = "https://www.example.com/fractions"
type profile = HtmlProvider<url>
let profile = profile.Load(url)
let [<Literal>] resultFile = #"C:\temp\data\Profile.csv"
let CsvResult =
do
use writer = new StreamWriter(resultFile, false)
writer.WriteLine "\"Date\";\"Fraction\""
for row in profile.Tables.Table1.Rows do
"\"" + row.``Date``.ToString() + "\"" + ";" |> writer.Write
"\"" + row.``Fraction``.ToString() + "\"" + ";" |> writer.WriteLine
writer.Close
let csvResult = CsvResult
Without seeing sample data I can't be 100% certain, but I'm guessing that it's parsing fractions as dates if the numbers involved would be valid dates in the culture you're using: e.g., 1/4 would be a valid date in any culture that uses / as a separator, and would be treated either as April 1st or as January 4th, depending on which parsing culture your system defaults to.
Other type providers in FSharp.Data (such as the CSV type provideryou could ) allow you to configure how each column will be parsed, but that's not an option the HTML type provider gives you. (Which is a bit of a missing feature, of course). But since the HTML type provider does allow you to specify the culture info for datetime and number parsing, one way you might be able to work around this is specify a culture that does not use / as a separator (but still uses . as a decimal point, since otherwise if the HTML you're parsing has numbers written like 1,000 for one thousand, that could be interpreted as 1). One such culture is the en-IN culture ("English (India)"), where the date separator is - and the decimal point is ..
So try passing Culture=System.Globalization.CultureInfo.GetCultureInfo("en-IN") in your HtmlProvider options, and see if that helps it stop treating fractions as dates.
The following combination of functions worked:
// http://www.fssnip.net/29/title/Regular-expression-active-pattern
module Solution =
open System
open System.Text.RegularExpressions
open FSharp.Data
let (|Regex|_|) pattern input =
let m = Regex.Match(input, pattern)
if m.Success then Some(List.tail [ for g in m.Groups -> g.Value ])
else None
let ptrnFraction = #"^([0-9]?[0-9]?)(\/)([0-9]?[0-9]?)$"
let ptrnDateTime = #"(\d{2})\/(\d{2})\/(\d{4}) (\d{2}):(\d{2}):(\d{2})"
let ToFraction input =
match input with
| Regex ptrnFraction [ numerator; operator; denominator ] ->
(numerator + operator + denominator).ToString()
| Regex ptrnDateTime [ day; month; year; hours; minutes; seconds ] ->
(day + "/" + month).ToString()
| _ -> "Not valid!"
let dtInput = #"05/09/2017 00:00:00"
let frcInput = #"13/20"
let outDate = ToFraction dtInput
printfn "Out Date: %s" outDate
let outFraction = ToFraction frcInput
printfn "Out Fraction: %s" outFraction
//Output:> Out Date: 05/09 Out Fraction: 13/20
Thus, I was able to replace:
"\"" + row.``Fraction``.ToString() + "\"" + ";" |> writer.WriteLine
with:
"\"" + ToFraction(row.``Fraction``.ToString()) + "\"" + ";" |> writer.Write
Thanks to #rmunn for the clarity of his explanations and the benefit of his expertise.

Can I do CloseMatch of Defined Parsing grammar?

I have Defined a grammar
column = Word(alphanums + '._`')
stmt = column + Literal("(") + Group(delimitedList( column )) +Literal(")")
Now I want to match below query using close match
sql = seller(food_type,count(sellers),sum(weight),Earned_money)
I do not want to change the grammar defined above. How do I closeMatch given
functions as a argument
result = stmt.parseString(sql)
print result.dump()
def Review(sql):
stmt = GetGrammer(sql)
result = stmt.parseString(sql,parseAll=False)
print result.dump()
Where var sql is a procedure of 400-500 lines. So I Am making a Automating code Review part.For This purpose I have written grammar for sql statements.
But it is throwing exceptions Even If there is a small string which is not matching.And It is terminating after that.I want that it should not abort even if exceptions are Comming Because I know that atleast parsable part is useful for reviewing database queries.
Get Grammar is returning grammar for Procedure and all these are sql statements.
def Getgrammar(sql):
InputParameters = delimitedList( Optional((_in|_out|_inout),'') + column +
DataType)
DeclarativeSyntax = (_declare + column + DataType+';')
createProcedureStmt = createProcedure +
StoredProcedure.setResultsName("Procedure") +
lpar +
Optional(InputParameters.setResultsName("Input"),'') +
rpar +
Optional(_sql_security_invoker,'').setResultsName("SQLSECURITY") +
_begin +
ZeroOrMore( DeclarativeSyntax ).setResultsName("Declare") +
ZeroOrMore( ( selectStmt|setStmt|ifStmt.setResultsName("IfStmt")|
callStmt|updateStmt|createStmt|dropStmt|alterStmt|insertStmt
|deleteStmt|WhileStmt.setResultsName("WhileStmt")|createStmt ) + ';') +
_end+Optional(';','')
return createProcedureStmt

pyparsing not parsing the whole string

I have the following grammar and test case:
from pyparsing import Word, nums, Forward, Suppress, OneOrMore, Group
#A grammar for a simple class of regular expressions
number = Word(nums)('number')
lparen = Suppress('(')
rparen = Suppress(')')
expression = Forward()('expression')
concatenation = Group(expression + expression)
concatenation.setResultsName('concatenation')
disjunction = Group(lparen + OneOrMore(expression + Suppress('|')) + expression + rparen)
disjunction.setResultsName('disjunction')
kleene = Group(lparen + expression + rparen + '*')
kleene.setResultsName('kleene')
expression << (number | disjunction | kleene | concatenation)
#Test a simple input
tests = """
(8)*((3|2)|2)
""".splitlines()[1:]
for t in tests:
print t
print expression.parseString(t)
print
The result should be
[['8', '*'],[['3', '2'], '2']]
but instead, I only get
[['8', '*']]
How do I get pyparsing to parse the whole string?
parseString has a parameter parseAll. If you call parseString with parseAll=True you will get error messages if your grammar does not parse the whole string. Go from there!
Your concatenation expression is not doing what you want, and comes close to being left-recursive (fortunately it is the last term in your expression). Your grammar works if you instead do:
expression << OneOrMore(number | disjunction | kleene)
With this change, I get this result:
[['8', '*'], [['3', '2'], '2']]
EDIT:
You an also avoid the precedence of << over | if you use the <<= operator instead:
expression <<= OneOrMore(number | disjunction | kleene)

Resources