How to create robust access logs using Apache Tomcat Valve Component? - http

We are working with Apache Tomcat 7 and trying to setup the Valve Component to store our access logs, ready for processing in SnowPlow.
The problem we have is how to make these logs robust. To give an example - we can separate fields with tabs and extract the user agent string like so:
pattern="%{yyyy-MM-dd}t %{hh:mm:ss}t %{User-Agent}i "
The problem is that the Valve Component does not (as far as I can see) escape %{User-Agent}i, so a stray tab in a useragent will corrupt the data (row will look like it contains four fields, not three).
As far as solutions, unless there's a way of escaping the useragent which I've missed, I can see a couple of solutions:
Use a really obscure field delimiter (or combination of field delimiters) which is very unlikely to crop up in a useragent string. We tried Ctrl-A (HTML ?) but that didn't seem to work
Write a custom AccessLogValve which either supports escaping or sanitizes tabs - perhaps similar to this post Sanitizing Tomcat access log entries
A bit puzzled that I can't find anything else about this online - does nobody parse their Tomcat access logs?
What do you recommend? We're a little stuck...

RFC2616 defines user agent string as
User-Agent = "User-Agent" ":" 1*( product | comment )
Then product is defined as
product = token ["/" product-version]
product-version = token
Following this, tokens are defined as
token = 1*<any CHAR except CTLs or separators>
and separators/CTLs as
separators = "(" | ")" | "<" | ">" | "#"
| "," | ";" | ":" | "\" | <">
| "/" | "[" | "]" | "?" | "="
| "{" | "}" | SP | HT
CTL = <any US-ASCII control character
(octets 0 - 31) and DEL (127)>
We need not to forget comment, which is defined as
comment = "(" *( ctext | quoted-pair | comment ) ")"
ctext = <any TEXT excluding "(" and ")">
quoted-pair = "\" CHAR
CHAR = <any US-ASCII character (octets 0 - 127)>
So if I understand correctly, you should be able to use any separator or CTL as long as you can distinguish comment, which is wrapped in ( and ). If ( appears inside the comment, it should be escaped with \.

In the end, I wrote a custom Tomcat AccessLogValve which:
Introduced a new pattern, 'I', to escape an incoming header
Introduced a new pattern, 'C', to fetch a cookie stored on the response
Re-implemented the pattern 'i' to ensure that "" (empty string) is replaced with "-"
Re-implemented the pattern 'q' to remove the "?" and ensure "" (empty string) is replaced with "-"
Overwrote the 'v' pattern, to write the version of this AccessLogValve, rather than the local server name
It seems to be pretty robust - I haven't had any further issues with unescaped values.

Related

(Maybe) Illegal character in ODBC SQL Server Connection String PWD=

According to what I have researched there are no illegal characters in the PWD= field of a SQL Server Connection String.
However, using SQL Server Express 2008 I changed the SA password to a GUID, specifically:
{85C86BD7-B15F-4C51-ADDA-3B6A50D89386}
So when connecting via ODBC I use this connection string:
"Driver={SQL Server};Server=.\\MyInstance;Database=Master;UID=SA;PWD={85C86BD7-B15F-4C51-ADDA-3B6A50D89386};"
But it comes back as Login failed for SA.
However, if I change the SA password to something just as long but without {}- it succeeds! Are there certain characters in PWD= that need to be escaped? I tried all different combinations with no luck.
As Microsoft's documentation states (emphasis added) --
Connection strings used by ODBC have the following syntax:
connection-string ::= empty-string[;] | attribute[;] | attribute; connection-string
empty-string ::=
attribute ::= attribute-keyword=[{]attribute-value[}]
attribute-value ::= character-string
attribute-keyword ::= identifier
Attribute values can optionally be enclosed in braces, and it is good practice to do so. This avoids problems when attribute values contain non-alphanumeric characters. The first closing brace in the value is assumed to terminate the value, so values cannot contain closing brace characters.
I would suggest you simply remove the braces when you set the password, and then the connect string you provided above should work fine.
ADDITION
I dug a bit further on Microsoft's site, and found some ABNF rules which may be relevant --
SC = %x3B ; Semicolon
LCB = %x7B ; Left curly brackets
RCB = %x7D ; Right curly brackets
EQ = %x3D ; Equal sign
ESCAPEDRCB = 2RCB ; Double right curly brackets
SpaceStr = *(SP) ; Any number (including 0) spaces
ODBCConnectionString = *(KeyValuePair SC) KeyValuePair [SC]
KeyValuePair = (Key EQ Value / SpaceStr)
Key = SpaceStr KeyName
KeyName = (nonSP-SC-EQ *nonEQ)
Value = (SpaceStr ValueFormat1 SpaceStr) / (ValueContent2)
ValueFormat1 = LCB ValueContent1 RCB
ValueContent1 = *(nonRCB / ESCAPEDRCB)
ValueContent2 = SpaceStr / SpaceStr (nonSP-LCB-SC) *nonSC
nonRCB = %x01-7C / %x7E- FFFF ; not "}"
nonSP-LCB-SC = %x01-1F / %x21-3A / %x3C-7A / %x7C- FFFF ; not space, "{" or ";"
nonSP-SC-EQ = %x01-1F / %x21-3A / %x3C / %x3E- FFFF ; not space, ";" or "="
nonEQ = %x01-3C / %x3E- FFFF ; not "="
nonSC = %x01-003A / %x3C- FFFF ; not ";"
...
ValueFormat1 is recommended to use when there is a need for Value to contain LCB, RCB, or EQ. ValueFormat1 MUST be used when the Value contains SC or starts with LCB.
ValueContent1 MUST be enclosed by LCB and RCB. Spaces before the enclosing LCB and after the enclosing RCB MUST be ignored.
ValueContent1 MUST be contained in ValueFormat1. If there is an RCB in the ValueContent1, it MUST use the two-character sequence ESCAPEDRCB to represent the one-character value RCB.
All of which comes down to... I believe the following connect string should work for you (note that there are 2 left/open braces and 3 right/close braces on the PWD value) --
"Driver={SQL Server};Server=.\\MyInstance;Database=Master;UID=SA;PWD={{85C86BD7-B15F-4C51-ADDA-3B6A50D89386}}};"
According to this page, the only legal "special character" in a name (I think they're talking about the DSN) is the UNDERSCORE:
The ODBC specification (and the SQL specification) states that names
must be in the format of " letter[digit | letter | _]...". The only
special character allowed is an underscore.
There was no reference to "the ODBC Specification". This page says it's the the ODBC 4.0 Spec.

Should packed cookies be treated as a single cookie?

I see some sites, like stackoverflow, use packed cookies where multiple cookies are packed into one. Here's an example:
Set-Cookie: acct=t=&s=; domain=.stackapps.com; expires=Mon, 30-May-2016 20:16:22 GMT; path=/; HttpOnly
Is this just to save sending multiple set-cookie headers, and to avoid sending comma separated cookies on the one set-cookie header? That's allowed--but is it not recommended?
Should the packed cookie just be treated as a single cookie, or does it need to be unpacked and sent back as individual cookies?
I do not know from where the idea of "packed" came about. Those are just cookies with the = sign in the value, or at least should be according to the specs. Let us go through the RFCs and see that:
Set-Cookie: acct=t=&s=; domain=.stackapps.com; expires=...
is exactly the same as
Set-Cookie: acct="t=&s="; domain=.stackapps.com; expires=...
Therefore, it is a single cookie and shall be treated as such.
The answer is rather long, sorry for that. I tried to aim it at people who find the grammar rules found in the RFCs difficult to understand. If you believe that some piece of the grammar is still difficult to understand please point it to me in a comment.
Through the RFCs
The current RFC for the Set-Cookie header is RFC6265, in section 4.1 it has the formal syntax for Set-Cookie:
set-cookie-header = "Set-Cookie:" SP set-cookie-string
set-cookie-string = cookie-pair *( ";" SP cookie-av )
cookie-pair = cookie-name "=" cookie-value
cookie-name = token
cookie-value = *cookie-octet / ( DQUOTE *cookie-octet DQUOTE )
cookie-octet = %x21 / %x23-2B / %x2D-3A / %x3C-5B / %x5D-7E
; US-ASCII characters excluding CTLs,
; whitespace DQUOTE, comma, semicolon,
; and backslash
token = <token, defined in [RFC2616], Section 2.2>
cookie-av = expires-av / max-age-av / domain-av /
path-av / secure-av / httponly-av /
extension-av
expires-av = "Expires=" sane-cookie-date
sane-cookie-date = <rfc1123-date, defined in [RFC2616], Section 3.3.1>
max-age-av = "Max-Age=" non-zero-digit *DIGIT
; In practice, both expires-av and max-age-av
; are limited to dates representable by the
; user agent.
non-zero-digit = %x31-39
; digits 1 through 9
domain-av = "Domain=" domain-value
domain-value = <subdomain>
; defined in [RFC1034], Section 3.5, as
; enhanced by [RFC1123], Section 2.1
path-av = "Path=" path-value
path-value = <any CHAR except CTLs or ";">
secure-av = "Secure"
httponly-av = "HttpOnly"
extension-av = <any CHAR except CTLs or ";">
That is a little terse but we do not need to got through it all. For a start we have the Set-Cookie: header and a space (SP), then the set-cookie-string which is defined further.
set-cookie-header = "Set-Cookie:" SP set-cookie-string
set-cookie-string is composed of a cookie-pair (defined further), which is the grammar part that interests us, and optionally a set of any number of cookie-av prefixed with ; and a space. The *() construct allows for any number of occurrences (including zero) of the grammar part.
set-cookie-string = cookie-pair *( ";" SP cookie-av )
cookie-av defines the metadata that can be used in the cookie but it is not needed for our proof, therefore we will abandon its discussion.
The cookie-pair on the other hand is a very simple construct: one cookie-name one mandatory = sign and one cookie-value.
cookie-pair = cookie-name "=" cookie-value
The cookie-name is defined as a token which leads us to another RFC, RFC2616. In the section 2.2 of that RFC we find the basic rules that define the token.
cookie-name = token
token = <token, defined in [RFC2616], Section 2.2>
token definition:
CTL = <any US-ASCII control character
(octets 0 - 31) and DEL (127)>
...
token = 1*<any CHAR except CTLs or separators>
separators = "(" | ")" | "<" | ">" | "#"
| "," | ";" | ":" | "\" | <">
| "/" | "[" | "]" | "?" | "="
| "{" | "}" | SP | HT
The 1*<> syntax means any number of occurrences but at least one occurrence. To find the CTLs use man ascii and check the Dec column, SP is space (as we already saw) and HT is the horizontal tab (9 in the ascii table).
The interesting part for us is the fact that a token cannot contain an = character.
Back to RFC6265:
cookie-pair = cookie-name "=" cookie-value
cookie-name stops at the first = character, that first = character is always the = explicit in the grammar. Now, let's finally define the cookie-value
cookie-value = *cookie-octet / ( DQUOTE *cookie-octet DQUOTE )
cookie-octet = %x21 / %x23-2B / %x2D-3A / %x3C-5B / %x5D-7E
; US-ASCII characters excluding CTLs,
; whitespace DQUOTE, comma, semicolon,
; and backslash
We already saw that the * there means any occurrences including zero (note that empty cookies are allowed by the RFC!). The interesting part the entire cookie-value can be enclosed by double quotes (DQUOTE is the double quote character as you might have guessed).
But the most interesting part is that the = sign (x3D in the ascii) table is allowed as a cookie-octet
/ %x3C-5B / <- right there!
Yet the space (x20) and semicolon (x3B) are disallowed.
Conclusion
Therefore this Set-Cookie header shall be interpreted as
Set-Cookie: acct=t=&s=; domain=.stackapps.com; expires=...
cookie-set-header = "Set-Cookie:" SP set-cookie-string
set-cookie-string = cookie-pair *(";" cookie-av)
cookie-pair = cookie-name "=" cookie-value
cookie-name = "acct"
cookie-value = "t=&s="
And the header sending it back to the server shall be
Cookie: acct=t=&s=
Sending it as follows violates the RFC:
Cookie: acct=t&; s=

Is it necessary to html encode right angle brackets?

I'm adding some meta description data to my header like so:
HtmlMeta meta = new HtmlMeta();
meta.Name = "description";
meta.Content = description; // this is unencoded
page.Header.Controls.Add(meta);
And .net helpfully encodes things like & and <, but not >. Now, I can't imagine that this would be an oversight, so I conclude that it's unnecessary to escape them. But before I go back to the client with that answer, it would be nice to get confirmation by Some Strangers From The Intarwebs first :)
According to the XML specification > is indeed valid for attributes. Only <, & and " or ' need escaping.
[10] AttValue ::= '"' ([^<&"] | Reference)* '"'
| "'" ([^<&'] | Reference)* "'"

How to encode the plus (+) symbol in a URL

The URL link below will open a new Google mail window. The problem I have is that Google replaces all the plus (+) signs in the email body with blank space. It looks like it only happens with the + sign. How can I remedy this? (I am working on a ASP.NET web page.)
https://mail.google.com/mail?view=cm&tf=0&to=someemail#somedomain.com&su=some subject&body=Hi there+Hello there
(In the body email, "Hi there+Hello there" will show up as "Hi there Hello there")
The + character has a special meaning in [the query segment of] a URL => it means whitespace: . If you want to use the literal + sign there, you need to URL encode it to %2b:
body=Hi+there%2bHello+there
Here's an example of how you could properly generate URLs in .NET:
var uriBuilder = new UriBuilder("https://mail.google.com/mail");
var values = HttpUtility.ParseQueryString(string.Empty);
values["view"] = "cm";
values["tf"] = "0";
values["to"] = "someemail#somedomain.com";
values["su"] = "some subject";
values["body"] = "Hi there+Hello there";
uriBuilder.Query = values.ToString();
Console.WriteLine(uriBuilder.ToString());
The result:
https://mail.google.com:443/mail?view=cm&tf=0&to=someemail%40somedomain.com&su=some+subject&body=Hi+there%2bHello+there
If you want a plus + symbol in the body you have to encode it as 2B.
For example:
Try this
In order to encode a + value using JavaScript, you can use the encodeURIComponent function.
Example:
var url = "+11";
var encoded_url = encodeURIComponent(url);
console.log(encoded_url)
It's safer to always percent-encode all characters except those defined as "unreserved" in RFC-3986.
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
So, percent-encode the plus character and other special characters.
The problem that you are having with pluses is because, according to RFC-1866 (HTML 2.0 specification), paragraph 8.2.1. subparagraph 1., "The form field names and values are escaped: space characters are replaced by `+', and then reserved characters are escaped"). This way of encoding form data is also given in later HTML specifications, look for relevant paragraphs about application/x-www-form-urlencoded.
Just to add this to the list:
Uri.EscapeUriString("Hi there+Hello there") // Hi%20there+Hello%20there
Uri.EscapeDataString("Hi there+Hello there") // Hi%20there%2BHello%20there
See https://stackoverflow.com/a/34189188/98491
Usually you want to use EscapeDataString which does it right.
Generally if you use .NET API's - new Uri("someproto:with+plus").LocalPath or AbsolutePath will keep plus character in URL. (Same "someproto:with+plus" string)
but Uri.EscapeDataString("with+plus") will escape plus character and will produce "with%2Bplus".
Just to be consistent I would recommend to always escape plus character to "%2B" and use it everywhere - then no need to guess who thinks and what about your plus character.
I'm not sure why from escaped character '+' decoding would produce space character ' ' - but apparently it's the issue with some of components.

Is it possible to parse multi-line c style comments in MGrammar?

I've been hacking around with the May09 Oslo bits, experimented with tokenizing some source code. I can't seem to figure out how to correctly handle multiline C-style comments though.
For example: /*comment*/
Some cases that elude me:
/***/
or
/**//**/
I can make one or the other work, but not both.
The grammar was:
module Test {
language Comments {
token Comment =
MultiLineComment;
token MultiLineComment =
"/*" MultiLineCommentChar* "*/";
token MultiLineCommentChar =
^ "*" |
"*" PostAsteriskChar;
token PostAsteriskChar =
^ "*" |
"*" ^("*" | "/");
/*
token PostAsteriskChar =
^ "*" |
"*" PostAsteriskChar;
*/
syntax Main = Comment*;
}
}
The commented out token is what I think I want to do, however recursive tokens are not permitted.
The fact that MGrammar itself has "broken" multiline comments (it can't handle /***/) leads me to believe this isn't possible.
Does anyone know otherwise?
The way I have done it is as follows (not all my own code but I can't find a referance to the original author).
interleave Skippable = Whitespace | Comment;
interleave Comment = CommentToken;
#{Classification["Comment"]}
token CommentToken = CommentDelimited
| CommentLine;
token CommentDelimited = "/*" CommentDelimitedContent* "*/";
token CommentDelimitedContent
= ^('*')
| '*' ^('/');
token CommentLine = "//" CommentLineContent*;
token CommentLineContent
= ^(
'\u000A' // New Line
| '\u000D' // Carriage Return
| '\u0085' // Next Line
| '\u2028' // Line Separator
| '\u2029' // Paragraph Separator
);
This allows for both single line (//) comments as well as multiline (/* */) comments.

Resources