Is it possible to parse multi-line c style comments in MGrammar? - mgrammar

I've been hacking around with the May09 Oslo bits, experimented with tokenizing some source code. I can't seem to figure out how to correctly handle multiline C-style comments though.
For example: /*comment*/
Some cases that elude me:
/***/
or
/**//**/
I can make one or the other work, but not both.
The grammar was:
module Test {
language Comments {
token Comment =
MultiLineComment;
token MultiLineComment =
"/*" MultiLineCommentChar* "*/";
token MultiLineCommentChar =
^ "*" |
"*" PostAsteriskChar;
token PostAsteriskChar =
^ "*" |
"*" ^("*" | "/");
/*
token PostAsteriskChar =
^ "*" |
"*" PostAsteriskChar;
*/
syntax Main = Comment*;
}
}
The commented out token is what I think I want to do, however recursive tokens are not permitted.
The fact that MGrammar itself has "broken" multiline comments (it can't handle /***/) leads me to believe this isn't possible.
Does anyone know otherwise?

The way I have done it is as follows (not all my own code but I can't find a referance to the original author).
interleave Skippable = Whitespace | Comment;
interleave Comment = CommentToken;
#{Classification["Comment"]}
token CommentToken = CommentDelimited
| CommentLine;
token CommentDelimited = "/*" CommentDelimitedContent* "*/";
token CommentDelimitedContent
= ^('*')
| '*' ^('/');
token CommentLine = "//" CommentLineContent*;
token CommentLineContent
= ^(
'\u000A' // New Line
| '\u000D' // Carriage Return
| '\u0085' // Next Line
| '\u2028' // Line Separator
| '\u2029' // Paragraph Separator
);
This allows for both single line (//) comments as well as multiline (/* */) comments.

Related

How to get a part of url? (nginx, rewrite, lua)

How to get a part of url?
For example:
have url: /aaa/bbb/ccc?test=123
I need to get part of url /ccc into the nginx variable.
I want to get a variable with a value = "ccc"
I will use the variable in location {}
url = '/abc/def/ghi?test=123'
-- here ^ to here ^ then back 1
q = url :find('?')
head = url :sub( 1, q -1 ) -- /abc/def/ghi
-- reversing is just an easy way to grab what would
-- would be the last delimiter, but is now the first.
head = head :reverse() -- ihg/fed/cba/
slash = head :find('/') -- 1st ^
-- same as above, grab from beginning, to found position -1
folder = head :sub( 1, slash -1 ) -- ihg
folder = folder :reverse() -- ghi
-- reverse again, so it's right-way 'round again.
Commands can be stacked, to make it smaller. Harder
to explain the logic that way, but it's the same thing.
head = url :sub( 1, url :find('?') -1 ) :reverse()
folder = head :sub( 1, head :find('/') -1 ) :reverse()
I would suggest to use a library, available from official OpenResty Package Manager OPM:
opm install bungle/lua-resty-reqargs
This library works for both GET and POST variables:
local get, post, files = require "resty.reqargs"()
if not get then
local ErrorMessage = post
error(ErrorMessage)
end
-- Here you basically have `get' which is a table with:
-- { test = 123 }
The API is a little bit weird because the function returns 3 values.
In case of failure, it will return false, ErrorMessage, false.
In case of success, it will return GetTable, PostTable, FilesTable. Except this, it just works.
You can use it in a location block:
location /aaa/bbb/ccc {
content_by_lua_block {
local get, post, files = require "resty.reqargs"()
if not get then
local ErrorMessage = post
ngx.say(ErrorMessage)
else
ngx.say(string.format("test=%d", get.test))
end
}
}

One HTTP Delimiter to Rule Them All

I have a configuration file in the format of blah = foo. I would like to have entries like:
http = https://stackoverflow.com/questions,header keys and values,string to search for.
I'm okay requiring that the the url be urlecncoded. Is there any ASCII character I can use that won't be valid value anywhere in the above example (After splitting once on =)? My example uses a comma but I think that is valid in a header value?
After pouring through some RFCs I figure someone is more familiar with this can save me some pain.
Also my project is in Go if there are existing std library that might help with this...
You can use a non-ascii character and urlencode, for example using the middle dot (compose + ^ + . on linux):
const sep = `·`
const t = `http = https://stackoverflow.com/questions·string to search for·header=value·header=value`
func parseLine(line string) (name, url, search string, headers []string) {
idx := strings.Index(line, " = ")
if idx == -1 {
return
}
name = line[:idx]
parts := strings.Split(line[idx+3:], sep)
if len(parts) < 3 {
// handle invalid line
}
url, search = parts[0], parts[1]
headers = parts[2:]
return
}
Although, using JSON is probably the best and most long-term maintainable option.
For completeness sake, a json version would look like:
type Site struct {
Url string
Query string
Headers map[string]string
}
const t = `[
{
"url": "https://stackoverflow.com/questions",
"query": "string to search for",
"headers": {"header": "value", "header2": "value"}
},
{
"url": "https://google.com",
"query": "string to search for",
"headers": {"header": "value", "header2": "value"}
}
]`
func main() {
var sites []Site
err := json.Unmarshal([]byte(t), &sites)
fmt.Printf("%+v (%v)\n", sites, err)
}
Essentially you have to look at RFC 3986, RFC 7230 and friends to see what can occur.
URIs are simple if you insist on them to be valid, just use the space character or "<" and ">" as delimiters.
Field values can be almost anything; HTTP forbids control characters though, so you might be able to use horizontal TABs (if you're ok with getting into trouble with invalid field values).

Lexical error at line 0, column 0

In the grammar below, I am trying configure any line that starts with ' as a single line comment and anything betweeen /' Multiline Comment '/. The single line comment works ok. But for some reason as soon as I press / or ' or ';' or < or '>' I get the error below. I don't have above characters configured. Shouldn't they be considered default and skip parsing ?
Error
Lexical error at line 0, column 0. Encountered: "\"" (34), after : ""
Lexical error at line 0, column 0. Encountered: ">" (62), after : ""
Lexical error at line 0, column 0. Encountered: "\n" (10), after : "-"
I have only included part of the code below for conciseness. For full Lexer definition please visit the link
TOKEN :
{
< WHITESPACE:
" "
| "\t"
| "\n"
| "\r"
| "\f">
}
/* COMMENTS */
MORE :
{
<"/'"> { input_stream.backup(1); } : IN_MULTI_LINE_COMMENT
}
<IN_MULTI_LINE_COMMENT>
TOKEN :
{
<MULTI_LINE_COMMENT: "'/" > : DEFAULT
}
<IN_MULTI_LINE_COMMENT>
MORE :
{
< ~[] >
}
TOKEN :
{
<SINGLE_LINE_COMMENT: "'" (~["\n", "\r"])* ("\n" | "\r" | "\r\n")?>
}
I can't reproduce every aspect of your problem. You say there is an error "as soon as" you enter certain characters. Here is what I get.
/ There is no error unless the next character is not a '. If the next character is not ', there is an error.
' I see no error. This is correctly treated as the start of comment
; There is always an error. No token can start with ;.
< There only an error if the next characters are not - or <-.
> There always is an error. No token can start with >
I'm not exactly sure why you would expect these not to be errors, since your lexer has no rules to cover these cases. Generally when there is no rule to match a prefix of the input and the input is not exhausted, there will be a TokenMgrError thrown.
If you want to eliminate all these TokenMgrErrors, make a catch-all rule (as explained in the FAQ):
TOKEN: { <UNEXPECTED_CHARACTER: ~[] > }
Make sure this is the very last rule in the .jj file. This rule says that, when no other rule applies, the next character is treated as an UNEXPECTED_CHARACTER token. Of course this just boots the problem up to the parsing level. If you really want the tokenizer to skip all characters that don't belong, just use the following rule as the very last rule:
SKIP : { < ~[] > }
For most languages, that would be an odd thing to do, which is why it is not the default.

How to create robust access logs using Apache Tomcat Valve Component?

We are working with Apache Tomcat 7 and trying to setup the Valve Component to store our access logs, ready for processing in SnowPlow.
The problem we have is how to make these logs robust. To give an example - we can separate fields with tabs and extract the user agent string like so:
pattern="%{yyyy-MM-dd}t %{hh:mm:ss}t %{User-Agent}i "
The problem is that the Valve Component does not (as far as I can see) escape %{User-Agent}i, so a stray tab in a useragent will corrupt the data (row will look like it contains four fields, not three).
As far as solutions, unless there's a way of escaping the useragent which I've missed, I can see a couple of solutions:
Use a really obscure field delimiter (or combination of field delimiters) which is very unlikely to crop up in a useragent string. We tried Ctrl-A (HTML ?) but that didn't seem to work
Write a custom AccessLogValve which either supports escaping or sanitizes tabs - perhaps similar to this post Sanitizing Tomcat access log entries
A bit puzzled that I can't find anything else about this online - does nobody parse their Tomcat access logs?
What do you recommend? We're a little stuck...
RFC2616 defines user agent string as
User-Agent = "User-Agent" ":" 1*( product | comment )
Then product is defined as
product = token ["/" product-version]
product-version = token
Following this, tokens are defined as
token = 1*<any CHAR except CTLs or separators>
and separators/CTLs as
separators = "(" | ")" | "<" | ">" | "#"
| "," | ";" | ":" | "\" | <">
| "/" | "[" | "]" | "?" | "="
| "{" | "}" | SP | HT
CTL = <any US-ASCII control character
(octets 0 - 31) and DEL (127)>
We need not to forget comment, which is defined as
comment = "(" *( ctext | quoted-pair | comment ) ")"
ctext = <any TEXT excluding "(" and ")">
quoted-pair = "\" CHAR
CHAR = <any US-ASCII character (octets 0 - 127)>
So if I understand correctly, you should be able to use any separator or CTL as long as you can distinguish comment, which is wrapped in ( and ). If ( appears inside the comment, it should be escaped with \.
In the end, I wrote a custom Tomcat AccessLogValve which:
Introduced a new pattern, 'I', to escape an incoming header
Introduced a new pattern, 'C', to fetch a cookie stored on the response
Re-implemented the pattern 'i' to ensure that "" (empty string) is replaced with "-"
Re-implemented the pattern 'q' to remove the "?" and ensure "" (empty string) is replaced with "-"
Overwrote the 'v' pattern, to write the version of this AccessLogValve, rather than the local server name
It seems to be pretty robust - I haven't had any further issues with unescaped values.

Is it necessary to html encode right angle brackets?

I'm adding some meta description data to my header like so:
HtmlMeta meta = new HtmlMeta();
meta.Name = "description";
meta.Content = description; // this is unencoded
page.Header.Controls.Add(meta);
And .net helpfully encodes things like & and <, but not >. Now, I can't imagine that this would be an oversight, so I conclude that it's unnecessary to escape them. But before I go back to the client with that answer, it would be nice to get confirmation by Some Strangers From The Intarwebs first :)
According to the XML specification > is indeed valid for attributes. Only <, & and " or ' need escaping.
[10] AttValue ::= '"' ([^<&"] | Reference)* '"'
| "'" ([^<&'] | Reference)* "'"

Resources