Xhtml Invalid Characters? - xhtml

I have made custom xhtml valdidator in .NET(validating through dtd + some extra rules) and I have noticed a discrepancy between my validation and w3c validation.
In my validator I get the following error when there is colon in the id (let's say : id="mustang:horse")
(Error) The 'id' attribute has an invalid value according to its data type.
But I do not get any errors on w3c for this pattern.
I tried to find a list of invalid characters for an attribute in xml/xhtml but couldn't find it?
Thank you for your help,

There is a list and and it does permit colons.
The XHTML 1.0 spec says at http://www.w3.org/TR/xhtml1/#h-4.10
... in XHTML 1.0 the id attribute is defined to be of type ID ...
The XML 1.0 spec says at http://www.w3.org/TR/2008/REC-xml-20081126/#id
Values of type ID MUST match the Name production.
And the Name production is defined at http://www.w3.org/TR/2008/REC-xml-20081126/#NT-Name
[4] NameStartChar ::= ":" |
[A-Z] | "_" | [a-z] | [#xC0-#xD6] |
[#xD8-#xF6] | [#xF8-#x2FF] |
[#x370-#x37D] | [#x37F-#x1FFF] |
[#x200C-#x200D] | [#x2070-#x218F] |
[#x2C00-#x2FEF] | [#x3001-#xD7FF] |
[#xF900-#xFDCF] | [#xFDF0-#xFFFD] |
[#x10000-#xEFFFF]
[4a] NameChar ::= NameStartChar | "-" | "." |
[0-9] | #xB7 | [#x0300-#x036F] |
[#x203F-#x2040]
[5] Name ::= NameStartChar (NameChar)*
And also says above this formal definition:
Document authors are encouraged to use
names which are meaningful words or
combinations of words in natural
languages, and to avoid symbolic or
white space characters in names. Note
that COLON, HYPHEN-MINUS, FULL STOP
(period), LOW LINE (underscore), and
MIDDLE DOT are explicitly permitted.
(My emphasis)

The reason for this difference is that W3C validator doesn't seem to do namespace aware XHTML processing. Although XHTML documents need to be in the XHTML namespace, this is actually reasonable, because HTML documents are not using namespaces and the normative valid structure of XHTML documents (as HTML) is defined by a DTD file and DTDs are not actually namespace aware.
Like #Alochi already noted:
Values of type ID MUST match the Name
production.
This is true when the document is parsed as not namespace aware, but it is not true if the document needs to be namespace conformant. The Namespaces in XML specification states that IDs must match NCName production which explicitly forbids the colon character. Namespace aware parsing is a common convention and therefore using a colon in the value of a id is not recommended even though it is allowed when the document parsing is not namespace aware .
Summary: if namespaces are ignored, an ID value must be a valid Name and it can contain a colon; otherwise it must be a valid NCName and it can't contain a colon.

Related

How to not escape route parameters when generating URL?

I have an Asset entity with a field called symbol. This field can basically contain any human-readable string, including special symbols.
I'd like to generate a URL with this symbol as a parameter, but without it being escaped.
For instance I have an Asset with symbol $, but it's being generated as assets/%24
I need to be able to generate it in the Twig template without escaping these characters.
I'm using Symfony 5.
$ is a reserved character as specified in the RFC2393 :
2.2. Reserved Characters
Many URI include components consisting of or delimited by, certain
special characters. These characters are called "reserved", since
their usage within the URI component is limited to their reserved
purpose. If the data for a URI component would conflict with the
reserved purpose, then the conflicting data must be escaped before
forming the URI.
reserved = ";" | "/" | "?" | ":" | "#" | "&" | "=" | "+" |
"$" | ","
If you don't mind not following this recommandation, you could try to url_decode your generated url by creating a Twig filter and use it like this :
{{ asset(...)|urldecode }}

Are regexs allowed in BNF and EBNF notations?

If I wanted for example to define the Lisp programming language, where a name can include even non-alphanumeric characters, should I list all the usable characters with a notation like:
validchar ::= "a" | "b" | "c" ... "-" | "*" | "$" ... ;
name = validchar, (validchar | digit)+;
Or am I allowed to use regexs, like:
validchar ::= "[^(^)^\s^\d]";
name ::= validchar, (validchar | digit)*;
Or even:
name ::= "[^(^)^\s^\d]", "[^(^)^\s]"*;
This would shorten it a lot, and it would include even characters like ₩, ¥, € and so on, which I can't list but are actually usable.
Whether this is allowed depends on the tool you are using that implements the (E)BNF notation.
Some tools are rather strict and stick to the original definition of (E)BNF, allowing at best Kleene * or + on language tokens. An additional point is that there is no requirement for classic (E)BNF to operate on characters as terminals.
Clearly it is convenient to be able to define some language tokens directly in terms of characters, and one can imagine (as you have) an EBNF in which one can write not only characters as terminals, but also regexes over characters.
Whether the tool you propose to use allows that... depends entirely on the tool. Many tools that process (E)BNF such as YACC are actually designed to work in conjunction with another tool, a "lexer generator" (for YACC, this is called FLEX) that defines character sequences for tokens. With such tool pairs, the (E)BNF tool typically does not allow any mention of characters or regexes over them, but the lexer generator tool explicitly does allow character and regex specifications for tokens.
There are hundreds of (E)BNF and lexer generator tools, each with somewhat (egregiously different) rules. Check the tool documentation.
Or write it the way you want to write it, and build your own (101st) tool.

Query String Delimiters

So I know you can separate your parameters in a query string through a couple different characters
(eg. www.example.com?foo=1&bar=2 or www.example.com?foo1;bar=2)
Are there any characters other than ';' and '&' that can be used to separate query parameters? Is it just general coding practice to use ';' or '&' or are there some regulations that list which characters I can use? I know in RFC 3986 the reserved characters include
";" | "/" | "?" | ":" | "#" | "&" | "=" | "+" | "$" | ","
So does this mean that any of these characters can be used to separate query parameters?
The format of the query string contents isn't part of RFC 3986 (URIs) or RFC 723* (HTTP); it's a side effect of how HTML forms work.
So if your code needs to work with HTML forms, you are restricted to what browsers do. Otherwise, in theory, you can use any format you want, as long as it's consistent with RFC 3986's definition of the "query" component.

Avoiding left recursion in parsing LiveScript object definitions

I'm working on a parser for LiveScript language, and am having trouble with parsing both object property definition forms — key: value and (+|-)key — together. For example:
prop: "val"
+boolProp
-boolProp
prop2: val2
I have the key: value form working with this:
Expression ::= TestExpression
| ParenExpression
| OpExpression
| ObjDefExpression
| PropDefExpression
| LiteralExpression
| ReferenceExpression
PropDefExpression ::= Expression COLON Expression
ObjDefExpression ::= PropDefExpression (NEWLINE PropDefExpression)*
// ... other expressions
But however I try to add ("+"|"-") IDENTIFIER to PropDefExpression or ObjDefExpression, I get errors about using left recursion. What's the (right) way to do this?
The grammar fragment you posted is already left-recursive, i.e. without even adding (+|-)boolprop, the non-terminal 'Expression' derives a form in which 'Expression' reappears as the leftmost symbol:
Expression -> PropDefExpression -> Expression COLON Expression
And it's not just left-recursive, it's ambiguous. E.g.
Expression COLON Expression COLON Expression
can be derived in two different ways (roughly, left-associative vs right-associative).
You can eliminate both these problems by using something more restricted on the left of the colon, e.g.:
PropDefExpression ::= Identifier COLON Expression
Also, another ambiguity: Expression derives PropDefExpression in two different ways, directly and via ObjDefExpression. My guess is, you can drop the direct derivation.
Once you've taken care of those things, it seems to me you should be able to add (+|-)boolprop without errors (unless it conflicts with one of the other kinds of expression that you didn't show).
Mind you, looking at the examples at http://livescript.net, I'm doubtful how much of that you'll be able to capture in a conventional grammar. But if you're just going for a subset, you might be okay.
I don't know how much help this will be, because I know nothing about GrammarKit and not much more about the language you're trying to parse.
However, it seems to me that
PropDefExpression ::= Expression COLON Expression
is not quite accurate, and it is creating an ambiguity when you add the boolean property production because an Expression might start with a unary - operator. In the actual grammar, though, a property cannot start with an arbitrary Expression. There are two types of key-property definitions:
name : expression
parenthesized_expression : expression
(Which is to say, expressions need to start with a ().
That means that a boolean property definition, starting with + or - is recognizable from the first token, which is precisely the condition needed for successful recursive descent parsing. There are several other property definition syntaxes, including names and parenthesized_expressions not followed by a :
That's easy to parse with an LR(1) parser, like the one Jison produces, but to parse it with a recursive-descent parser you need to left-factor. (It's possible that GrammarKit can do this for you, by the way.) Basically, you'd need something like (this is not complete):
PropertyDefinition ::= PropertyPrefix PropertySuffix? | BooleanProperty
PropertyPrefix ::= NAME | ParenthesizedExpression
PropertySuffix ::= COLON Expression | DOT NAME

What are the different kinds of cases?

I'm interested in the different kinds of identifier cases, and what people call them. Do you know of any additions to this list, or other alternative names?
myIdentifier : Camel case (e.g. in java variable names)
MyIdentifier : Capital camel case (e.g. in java class names)
my_identifier : Snake case (e.g. in python variable names)
my-identifier : Kebab case (e.g. in racket names)
myidentifier : Flat case (e.g. in java package names)
MY_IDENTIFIER : Upper case (e.g. in C constant names)
flatcase or mumblecase
kebab-case. Also called caterpillar-case, dash-case, hyphen-case, lisp-case, spinal-case and css-case
camelCase
PascalCase or CapitalCamelCase
snake_case or c_case
MACRO_CASE, UPPER_CASE or SCREAM_CASE
COBOL-CASE or TRAIN-CASE
Names are either generic, after a language, or colorful; most don’t have a standard name outside of a specific community.
There are many names for these naming conventions (names for names!); see Naming convention: Multiple-word identifiers, particularly for CamelCase (UpperCamelCase, lowerCamelCase). However, many don’t have a standard name. Consider the Python style guide PEP 0008 – it calls them by generic names like “lower_case_with_underscores”.
One convention is to name after a well-known use. This results in:
PascalCase
MACRO_CASE (C preprocessor macros)
…and suggests these names, which are not widely used:
c_case (used in K&R and in the standard library, like size_t)
lisp-case, css-case
COBOL-CASE
Alternatively, there are illustrative names, of which the best established is CamelCase. snake_case is more recent (2004), but is now well-established. kebab-case is yet more recent and still not established, and may have originated on Stack Overflow! (What's the name for dash-separated case?) There are many more colorful suggestions, like caterpillar-case, Train-case (initial capital), caravan-case, etc.
+--------------------------+-------------------------------------------------------------+
| Formatting | Name(s) |
+--------------------------+-------------------------------------------------------------|
| namingidentifier | flat case/Lazy Case |
| NAMINGIDENTIFIER | upper flat case |
| namingIdentifier | (lower) camelCase, dromedaryCase |
| NamingIdentifier | (upper) CamelCase, PascalCase, StudlyCase, CapitalCamelCase |
| naming_identifier | snake_case, snake_case, pothole_case, C Case |
| Naming_Identifier | Camel_Snake_Case |
| NAMING_IDENTIFIER | SCREAMING_SNAKE_CASE, MACRO_CASE, UPPER_CASE, CONSTANT_CASE |
| naming-identifier | Kebab Case/caterpillar-case/dash-case, hyphen-case, |
| | lisp-case, spinal-case and css-case |
| NAMING-IDENTIFIER | TRAIN-CASE, COBOL-CASE, SCREAMING-KEBAB-CASE |
| Naming-Identifier | Train-Case, HTTP-Header-Case |
| _namingIdentifier | Undercore Notation (prefixed by "_" followed by camelCase |
| datatypeNamingIdentifier | Hungarian Notation (variable names Prefixed by metadata |
| | data-types which is out-dated) |
|--------------------------+-------------------------------------------------------------+
MyVariable : Pascal Case => Used for Class
myVariable : Camel Case => Used for variable at Java, C#, etc.
myvariable : Flat Case => Used for package at Java, etc.
my_variable : Snake Case => Used for variable at Python, PHP, etc.
my-variable : Kebab Case => Used for css
The most common case types:
Camel case
Snake case
Kebab case
Pascal case
Upper case (with snake case)
camelCase
camelCase must (1) start with a lowercase letter and (2) the first letter of every new subsequent word has its first letter capitalized and is compounded with the previous word.
An example of camel case of the variable camel case var is camelCaseVar.
snake_case
snake_case is as simple as replacing all spaces with a "_" and lowercasing all the words. It's possible to snake_case and mix camelCase and PascalCase but imo, that ultimately defeats the purpose.
An example of snake case of the variable snake case var is snake_case_var.
kebab-case
kebab-case is as simple as replacing all spaces with a "-" and lowercasing all the words. It's possible to kebab-case and mix camelCase and PascalCase but that ultimately defeats the purpose.
An example of kebab case of the variable kebab case var is kebab-case-var.
PascalCase
PascalCase has every word starts with an uppercase letter (unlike camelCase in that the first word starts with a lowercase letter).
An example of pascal case of the variable pascal case var is PascalCaseVar.
Note: It's common to see this confused for camel case, but it's a separate case type altogether.
UPPER_CASE_SNAKE_CASE
UPPER_CASE_SNAKE_CASE is replacing all the spaces with a "_" and converting all the letters to capitals.
an example of upper case snake case of the variable upper case snake case var is UPPER_CASE_SNAKE_CASE_VAR.
For Python specifically, it is best to use snake_case for variable and function names, UPPER_CASE for constants (even though we don't have any keywords that specifically say that our variable is a constant) and PascalCase for class names.
camelCase is not recommended for Python (although languages such as Javascript have it as their main casing), and kebab-case would be invalid as Python names cannot contain a hypen (-).
variable_name = 'Hello World!'
def function_name():
pass
CONSTANT_NAME = 'Constant Hello World!!'
class ClassName:
pass

Resources