Allowed syntax of comments within CSS selectors - css

In CSS, often it improves code readability to not wrap long lines, but instead ensure all lines fit horizontally within the code editor's viewport.
But whitespace is very significant in CSS, often making this goal challenging.
Thus, sometimes embedding a comment within a long CSS selector (in a CSS ruleset) is a reasonable choice.
I have found that selectors like:
div./*
*/class
work reliably, but selectors like:
div/*
*/.class
are less supported. At least, I'm getting an error in Stylish when using CSS Lint with this one.
Are either (or both) of these technically valid, and if so, where in the RFC is this indicated?

Before I answer your question, I want to assure you that whitespace is not as much of a problem in CSS selectors as you may think, and that it is actually insignificant most of the time. There are very few places where it is and only one of them you encounter in everyday use: descendant combinators. And even then you can still use a line break in place of a space and it'll still be parsed as a descendant combinator. There's only one situation I can think of and that is if the identifiers (class, ID, attribute, etc) in your compound selectors are getting too long, and you want to break up your compound selectors. That's probably a sign of issues out of your control though, so I won't judge. Now let's get to your question.
These specific examples aren't documented in the spec. To answer your question upfront: they are both valid. To understand why, you'll need to understand how tokenization works in CSS, which is covered in a specification called css-syntax. Thankfully, one crucial thing CSS has in common with many other languages whose comments have start and end delimiters, is that if a comment is sitting cleanly between two distinct tokens and neither is being broken up, then those two tokens will parse exactly the same as if the comment wasn't there.
But how CSS is tokenized can be a bit of a surprise. One might assume that a class selector such as .class would be considered a single token, based on the Selectors grammar, and therefore a comment anywhere within it would break it and cause a parse error:
<class-selector> = '.' <ident-token>
However, <class-selector> is a production, that consists of two tokens: the dot which is considered a <delim-token>, followed by an <ident-token>. Since the dot exists as a separate token from the ident that would form the class name, a comment may exist cleanly between both tokens (./**/class) while still allowing this to be parsed as a valid class selector.
This applies to class selectors, pseudo-classes (:nth-child()) and pseudo-elements (::first-letter). However it does not apply to ID selectors because an ID selector is actually a single <hash-token> (think hex color values), a comment cannot appear before a ( because reasons, nor can it appear next to a hyphen within an ident because it's part of the ident.
Having said that, a comment sitting between two characters doesn't cause a parse error right away if the resulting two tokens can still parse. But context matters. Here's an example:
.cla/**/ss
This gets parsed into the following tokens:
<delim-token> '.'
<ident-token> 'cla'
<comment-token> (empty)
<ident-token> 'ss'
This isn't an error in and of itself, because if we forget the dot for a moment then we really just have two idents with a comment between them, and such cases are valid CSS anywhere you may have two or more idents otherwise separated by whitespace, like border: thin/**/dashed being equivalent to border: thin dashed.
But this becomes an error in Selectors because the Selectors grammar doesn't allow two consecutive idents in that context (there's a limited number of places where it's allowed such as unquoted attribute selectors with an i/s flag).
As for div/**/.class, since div and .class are two distinct productions (a <type-selector> followed by a <class-selector>), a comment sitting cleanly between them won't have any effect on parsing, so this'll still be parsed as a compound selector without a descendant combinator.
The only browsers that I know have trouble parsing selectors with comments inside them are IE8 and older. This fact has been exploited over the years to produce reliable selector hacks. If you really have to use comments to hide line breaks that would otherwise break your selectors (because you've run out of places you could substitute regular line breaks), I'd recommend using them to separate entire simple selectors rather than delimiters from names because it's a little bit more readable that way. Nevertheless, the Selectors level 4 spec helpfully provides a list of places where whitespace isn't allowed within a selector and you can therefore substitute a comment in a way CSS Lint has evidently failed to account for:
White space is forbidden:
Between any of the top-level components of a <compound-selector> (that is, forbidden between the <type-selector> and <subclass-selector>, or between the <subclass-selector> and <pseudo-element-selector>, etc).
Between any of the components of a <type-selector> or a <class-selector>.
Between the ':'s, or between the ':' and <ident-token> or <function-token>, of a <pseudo-element-selector> or a <pseudo-class-selector>.
Between any of the components of a <wq-name>.
Between the components of an <attr-matcher>.
Between the components of a <combinator>.
Note that whitespace (and therefore line breaks) is allowed in most parts of an attribute selector, so the use of comments is unnecessary. Note also that the one exception to this list is <attr-matcher>, which appears to be a single token rather than two <delim-token>s. I can't find this documented anywhere.
Again, I really can't imagine having to do this, but hey, at least you learned something about CSS tokenization, right?

Related

Were there any technical limitations within CSS that led to the decision of using `--` for vars?

With the introduction of css vars, I would like to know the reasoning of of why -- would be chosen as a way of denoting a var.
Considering CSS has a semi capable calc function, I feel like the -- could easily be confused for a decrement operator in other languages.
I am curious if there is any historical significance or technical limitation that led to choosing --. The double in particular perplexes me, when CSS markers are generally singles (#, ., #, etc). Also using a symbol already being used by other things also is interesting (especially when its valid for a class name to begin with --).
Example:
#custom-media --lt-sm (width < 576px);
--grey300: #e0e0e0;
.navbarItem {
display: inline-block;
text-align: center;
border-top: 1px solid var(--grey300);
border-left: 1px solid var(--grey300);
#media (--lt-sm) {
flex-grow: 1;
}
&:last-child {
border-right: 1px solid var(--grey300);
}
}
Disclaimer
Some might argue the validity of this question, but understanding the why is a key technique to remembering a particular concept.
The only discussion I can find related to it:
In the telcon today, we resolved to use a "--" prefix to indicate
custom properties and other custom things.
We discussed whether the prefix is maintained or dropped when
referring to the custom property from a var() function, but didn't
actually resolve. Discussion in the call leaned toward dropping the
prefix, like I do currently with var-* properties, but some side
discussion with Simon and Sylvain argued for using the full name, as
there are confusing cases like "--0" or "----".
So, while I understand the potential confusion caused by authors
possibly thinking that var() can take any property name as an
argument, I think it's overruled by the confusion over what needs to
be escaped in various circumstances. Escaping rules are always very
confusing to authors, so I'm going to go with "use the custom property
name literally as the var() argument".
Reference:
http://lists.w3.org/Archives/Public/www-style/2014Mar/0467.html
https://www.w3.org/TR/css-variables/#defining-variables
On top of avoiding clashing with vendor prefixes that begin with a single dash as mentioned by the other answers, keep in mind that the grammar for a CSS ident doesn't allow anything other than letters, numbers, dashes and underscores (see section 4.1.3 of CSS2). Were a custom property to be denoted by any other symbol, every existing CSS implementation would need to update or even rewrite its parser just to accommodate whichever symbol was being used for custom property names.1
From the minutes of the telecon that's alluded to in the message, you can see that the hypothesis of avoiding clashing with vendor prefixes that begin with a single dash is true. The use of -- did require a minor parser change, because idents normally cannot start with double dashes (as klumme pointed out). I'm not sure exactly how existing implementations parse(d) declarations, but it's safe to assume that my reasoning still holds: since idents can start with one dash, consuming the first dash would not immediately result in a parse error, so a parser can determine if it's looking at a <property> or a <custom-property-name> (or then encounter a parse error if it doesn't support custom props) before deciding how it should proceed.
This change is also reflected in the css-variables spec, in section 2 (which I also cover here):
A custom property is any property whose name starts with two dashes (U+002D HYPHEN-MINUS), like --foo. The <custom-property-name>
production corresponds to this: it’s defined as any valid identifier that starts with two dashes.
So dashes are used because they fit into the existing grammar with only minor parser changes, and double in particular to prevent clashing with vendor prefixes (note that dashes and underscores are interchangeable for vendor prefixes, so a single underscore wouldn't cut it either).
1 As a matter of fact, I gave this very same reasoning in response to somebody else's question just a few weeks ago, although their question wasn't about the chosen prefix for custom prop names.
I think that it was potentially related in some way to how vendors have their own prefix in which hyphen denoted attributes are ignored if they're not understood by the browser/parser. I think. It's also one way that wouldn't clash with any other CSS function or object that already exists.
For example preprocessors like LESS use the # symbol whereas SASS uses the $ symbol. This was discussed over at the Software Engineering SE that has a pretty decent answer to go with it.
Essentially, it seems that it was for compatibility reasons above anything else. One could argue that it's also easy to denote what is a variable compared to other things like the & symbol and % symbol.
Here is some additional reading on the differences of CSS variables over on CSS Tricks which includes vanilla CSS, LESS and SASS etc.
My thinking is it has to do with backwards compatibility. When you add something to CSS, not only do you have to make sure the syntax can be parsed unambiguously, you have to make sure that the new feature will not affect how old browsers parse the parts of the stylesheet that they are actually able to understand.
As variables can be declared inside a rule set, one hyphen could cause confusion with vendor prefixes. And the class name in a class selector is actually not allowed to start with two hyphens, at least in the CSS 2.1 spec.

How can I use/emulate regex-like backreferences in attribute selectors?

What I want to do is write a selector that matches an arbitrary value in one place, then later requires a different value be equal to it. If [attr="value"] parsed "value" as a regex, then this would solve my problem:
*[class="(.+)"] *[class="if_\1"] {/* styles */}
Obviously I could just list each possible class individually, but that takes a great deal of convenience out of it.
Is this possible?
No, it's not possible. Attribute selectors are almost completely static and provide almost no dynamic matching functionality (beyond that of substring matches, which are still static and not dynamic pattern-based). They do not support anything like what you see in everyday regular expressions.
A stylesheet preprocessor such as Sass or LESS will allow you to generate the static CSS rules needed, but all that does is automate the manual task of listing all possible values individually, which proves my first point.

How to find everything that is not matching a regular expression

I would like to search all over thousands of HTML code for bad practice of height, width or any other CSS.
for instance I would like to get all places where height is not provided with units, for instance height:40 should be found, but height:40px shouldn't.
For that I am using the search program agent ransack, in which I can put regular expression to search within files.
Currently my regular expression is:
(height:)[\s]*[0-9]*\.?[0-9]+(px)
this finds everything that is like height:40px. (Later on I want to add width, or other things)
My question is how to make a NOT on top of all that?
Or is there any other good application to search files for regular expressions?
Use a negative lookahead (?!regex), eg:
height:\s*\d+(?:\.\d+)?(?!px|\d)
\d is needed to prevent backtracking alternatives from matching.
Consider using a program that already does this, and add you own rules. In general, the practice of fixing up source code is called 'linting'. So, you can find quickly something like CSSLint which is open source and allows custom rules:
https://github.com/stubbornella/csslint/wiki/Rules
http://csslint.net/
Use the negative lookahead this way:
(?!height:\s*\d+(?:\.\d+)?px)height:\s*\d+(?:\.\d+)?
Considering you are using css, you may want to include other units as valid ones like pt, em, %, etc... like the below regex
(?!height:\s*\d+(?:\.\d+)?(?:px|pt|em|%|cm|mm|in|ex|pc))height:\s*\d+(?:\.\d+)?
You can test it over Rubular

CSS Mnemonics: How do you remember whether # or . is for class or id?

#test is the selector for id="test"
.test is the selector for class="test"
but how do you remember which way round they are (eg not .=id)
Well, in truth these things are so common that most people don't need mnemonics to remember them, but here's something I came up with, if it helps:
In terms of a filename a . and then an extension denotes a type of thing. There can be many different things of this type. With CSS, using classes you can denote a single style for many elements of the same type.
In terms of a URL, a # denotes an anchor link to a specific spot in the document. It refers to one location only. With CSS, using IDs you denote a single style for a single specific element.
If a police officer catches you with "hash," he will ask to see your ID. If not you get to stay classy. It's really dumb, but that's how I remember.
CallerID shows a phone #.
Periods are round like pearls, and-- Pearls are classy.
(P.S.: What's with all the "you learn differently than me, so you suck" comments? Goodness. Repetition is OK for me, but if I can visualize something I pick things up more quickly. In fact, the weirder something is the easier it is to memorize!)
I learned it the same way I learned that quotes (rather than parentheses) are used for attributes' values — by typing them a couple of times.
If you or someone you know gets tripped up by # vs ., though, consider that many programming languages use a . to access the members of an class-typed object.
I see I'm a day late (and maybe a dollar short), but I had the same problem in the early days and the following helped:
for the dot (.) as the selector for Class, I remembered it as: "My class always starts on the dot, not a minute early or late."
for the number sign (#) for ID, I just reminded myself that an ID(entification) card is incomplete without its number.
Spend lots of time writing CSS. When you've got it wrong enough times, your brain will give in and retain it.

Semicolon as URL query separator

Although it is strongly recommended (W3C source, via Wikipedia) for web servers to support semicolon as a separator of URL query items (in addition to ampersand), it does not seem to be generally followed.
For example, compare
        http://www.google.com/search?q=nemo&oe=utf-8
        http://www.google.com/search?q=nemo;oe=utf-8
results. (In the latter case, semicolon is, or was at the time of writing this text, treated as ordinary string character, as if the url was: http://www.google.com/search?q=nemo%3Boe=utf-8)
Although the first URL parsing library i tried, behaves well:
>>> from urlparse import urlparse, query_qs
>>> url = 'http://www.google.com/search?q=nemo;oe=utf-8'
>>> parse_qs(urlparse(url).query)
{'q': ['nemo'], 'oe': ['utf-8']}
What is the current status of accepting semicolon as a separator, and what are potential issues or some interesting notes? (from both server and client point of view)
The W3C Recommendation from 1999 is obsolete. The current status, according to the 2014 W3C Recommendation, is that semicolon is now illegal as a parameter separator:
To decode application/x-www-form-urlencoded payloads, the following algorithm should be used. [...] The output of this algorithm is a sorted list of name-value pairs. [...]
Let strings be the result of strictly splitting the string payload on U+0026 AMPERSAND characters (&).
In other words, ?foo=bar;baz means the parameter foo will have the value bar;baz; whereas ?foo=bar;baz=sna should result in foo being bar;baz=sna (although technically illegal since the second = should be escaped to %3D).
As long as your HTTP server, and your server-side application, accept semicolons as separators, you should be good to go. I cannot see any drawbacks. As you said, the W3C spec is on your side:
We recommend that HTTP server implementors, and in particular, CGI implementors support the use of ";" in place of "&" to save authors the trouble of escaping "&" characters in this manner.
I agree with Bob Aman. The W3C spec is designed to make it easier to use anchor hyperlinks with URLs that look like form GET requests (e.g., http://www.host.com/?x=1&y=2). In this context, the ampersand conflicts with the system for character entity references, which all start with an ampersand (e.g., "). So W3C recommends that web servers allow a semicolon to be used as a field separator instead of an ampersand, to make it easier to write these URLs. But this solution requires that writers remember that the ampersand must be replaced by something, and that a ; is an equally valid field delimiter, even though web browsers universally use ampersands in the URL when submitting forms. That is arguably more difficult that remembering to replace the ampersand with an & in these links, just as would be done elsewhere in the document.
To make matters worse, until all web servers allow semicolons as field delimiters, URL writers can only use this shortcut for some hosts, and must use & for others. They will also have to change their code later if a given host stops allowing semicolon delimiters. This is certainly harder than simply using &, which will work for every server forever. This in turn removes any incentive for web servers to allow semicolons as field separators. Why bother, when everyone is already changing the ampersand to & instead of ;?
In short, HTML is a big mess (due to its leniency), and using semicolons help to simplify this a LOT. I estimate that when i factor in the complications that i've found, using ampersands as a separator makes the whole process about three times as complicated as using semicolons for separators instead!
I'm a .NET programmer and to my knowledge, .NET does not inherently allow ';' separators, so i wrote my own parsing and handling methods because i saw a tremendous value in using semicolons rather than the already problematic system of using ampersands as separators. Unfortunately, very respectable people (like #Bob Aman in another answer) do not see the value in why semicolon usage is far superior and so much simpler than using ampersands. So i now share a few points to perhaps persuade other respectable developers who don't recognize the value yet of using semicolons instead:
Using a querystring like '?a=1&b=2' in an HTML page is improper (without HTML encoding it first), but most of the time it works. This however is only due to most browsers being tolerant, and that tolerance can lead to hard-to-find bugs when, for instance, the value of the key value pair gets posted in an HTML page URL without proper encoding (directly as '?a=1&b=2' in the HTML source). A QueryString like '?who=me+&+you' is problematic too.
We people can have biases and can disagree about our biases all day long, so recognizing our biases is very important. For instance, i agree that i just think separating with ';' looks 'cleaner'. I agree that my 'cleaner' opinion is purely a bias. And another developer can have an equally opposite and equally valid bias. So my bias on this one point is not any more correct than the opposite bias.
But given the unbiased support of the semicolon making everyone's life easier in the long run, cannot be correctly disputed when the whole picture is taken into account. In short, using semicolons does make life simpler for everyone, with one exception: a small hurdle of getting used to something new. That's all. It's always more difficult to make anything change. But the difficulty of making the change pales in comparison to the continued difficulty of continuing to use &.
Using ; as a QueryString separator makes it MUCH simpler. Ampersand separators are more than twice as difficult to code properly than if semicolons were used. (I think) most implementations are not coded properly, so most implementations aren't twice as complicated. But then tracking down and fixing the bugs leads to lost productivity. Here, i point out 2 separate encoding steps needed to properly encode a QueryString when & is the separator:
Step 1: URL encode both the keys and values of the querystring.
Step 2: Concatenate the keys and values like 'a=1&b=2' after they are URL encoded from step 1.
Step 3: Then HTML encode the whole QueryString in the HTML source of the page.
So special encoding must be done twice for proper (bug free) URL encoding, and not just that, but the encodings are two distinct, different encoding types. The first is a URL encoding and the second is an HTML encoding (for HTML source code). If any of these is incorrect, then i can find you a bug. But step 3 is different for XML. For XML, then XML character entity encoding is needed instead (which is almost identical). My point is that the last encoding is dependent upon the context of the URL, whether that be in an HTML web page, or in XML documentation.
Now with the much simpler semicolon separators, the process is as one wud expect:
1: URL encode the keys and values,
2: concatenate the values together. (With no encoding for step 3.)
I think most web developers skip step 3 because browsers are so lenient. But this leads to bugs and more complications when hunting down those bugs or users not being able to do things if those bugs were not present, or writing bug reports, etc.
Another complication in real use is when writing XML documentation markup in my source code in both C# and VB.NET. Since & must be encoded, it's a real drag, literally, on my productivity. That extra step 3 makes it harder to read the source code too. So this harder-to-read deficit applies not only to HTML and XML, but also to other applications like C# and VB.NET code because their documentation uses XML documentation. So the step #3 encoding complication proliferates to other applications too.
So in summary, using the ; as a separator is simple because the (correct) process when using the semicolon is how one wud normally expect the process to be: only one step of encoding needs to take place.
Perhaps this wasn't too confusing. But all the confusion or difficulty is due to using a separation character that shud be HTML encoded. Thus '&' is the culprit. And semicolon relieves all that complication.
(I will point out that my 3 step vs 2 step process above is usually how many steps it would take for most applications. However, for completely robust code, all 3 steps are needed no matter which separator is used. But in my experience, most implementations are sloppy and not robust. So using semicolon as the querystring separator would make life easier for more people with less website and interop bugs, if everyone adopted the semicolon as the default instead of the ampersand.)

Resources