regex to find pattern not inside another pattern - css

I'm trying to write a regex to find all ID selectors in a CSS file. Basically, that means any word that starts with a #, so okay
#\w+
Except ... color specifiers can also start with a #. So what I really want is all words that start with a # that are NOT between { and }. I can't figure out how to say this.
I'm doing this in Notepad++ so I need that flavor of regex.
BTW my real objective is to delete everything that's not an ID selector from the file, so I end up with just a list of selectors. My first try was
Find: [^#]*(#\w+)
Replace: \1\r\n
... and then hit Replace All.
But then I ran into the color problem.
Update
Someone asks for an example. Ok:
Input:
.foo {max-width: 500px;}
#bar {text-align: left;}
.splunge, #plugh {color: #ff0088;}
Desired output:
#bar
#plugh
Note the point is that it includes the two "pound strings" that come outside of braces but not the one that comes inside braces.

What about this? You could use a lookahead expression:
#\w+(?=[^}]*?{)
It ensures that a { follows the match (indicating that the match is part of a selector), but not after a } character (excluding any matches against color declarations in the CSS).
#: match must begin with a #
\w+: match one or more word characters (might need tweaked. \w is equivalent to [A-Za-z0-9_])
(?=...): positive lookahead
[^}]*?: Any character not matching }
{: the { character
https://regex101.com/r/Di43hX/3

Related

Why are my CSS grid starts and spans not working? [duplicate]

What characters/symbols are allowed within the CSS class selectors?
I know that the following characters are invalid, but what characters are valid?
~ ! # $ % ^ & * ( ) + = , . / ' ; : " ? > < [ ] \ { } | ` #
You can check directly at the CSS grammar.
Basically1, a name must begin with an underscore (_), a hyphen (-), or a letter(a–z), followed by any number of hyphens, underscores, letters, or numbers. There is a catch: if the first character is a hyphen, the second character must2 be a letter or underscore, and the name must be at least 2 characters long.
-?[_a-zA-Z]+[_a-zA-Z0-9-]*
In short, the previous rule translates to the following, extracted from the W3C specification:
In CSS, identifiers (including element names, classes, and IDs in
selectors) can contain only the characters [a-z0-9] and ISO 10646
characters U+00A0 and higher, plus the hyphen (-) and the underscore
(_); they cannot start with a digit, or a hyphen followed by a digit.
Identifiers can also contain escaped characters and any ISO 10646
character as a numeric code (see next item). For instance, the
identifier "B&W?" may be written as "B&W?" or "B\26 W\3F".
Identifiers beginning with a hyphen or underscore are typically reserved for browser-specific extensions, as in -moz-opacity.
1 It's all made a bit more complicated by the inclusion of escaped Unicode characters (that no one really uses).
2 Note that, according to the grammar I linked, a rule starting with two hyphens, e.g., --indent1, is invalid. However, I'm pretty sure I've seen this in practice.
To my surprise most answers here are wrong. It turns out that:
Any character except NUL is allowed in CSS class names in CSS. (If CSS contains NUL (escaped or not), the result is undefined. [CSS-characters])
Mathias Bynens' answer links to explanation and demos showing how to use these names. Written down in CSS code, a class name may need escaping, but that doesn’t change the class name. E.g. an unnecessarily over-escaped representation will look different from other representations of that name, but it still refers to the same class name.
Most other (programming) languages don’t have that concept of escaping variable names (“identifiers”), so all representations of a variable have to look the same. This is not the case in CSS.
Note that in HTML there is no way to include space characters (space, tab, line feed, form feed and carriage return) in a class name attribute, because they already separate classes from each other.
So, if you need to turn a random string into a CSS class name: take care of NUL and space, and escape (accordingly for CSS or HTML). Done.
I’ve answered your question in-depth at CSS character escape sequences. The article also explains how to escape any character in CSS (and JavaScript), and I made a handy tool for this as well. From that page:
If you were to give an element an ID value of ~!#$%^&*()_+-=,./';:"?><[]{}|`#, the selector would look like this:
CSS:
<style>
#\~\!\#\$\%\^\&\*\(\)\_\+-\=\,\.\/\'\;\:\"\?\>\<\[\]\\\{\}\|\`\#
{
background: hotpink;
}
</style>
JavaScript:
<script>
// document.getElementById or similar
document.getElementById('~!#$%^&*()_+-=,./\';:"?><[]\\{}|`#');
// document.querySelector or similar
$('#\\~\\!\\#\\$\\%\\^\\&\\*\\(\\)\\_\\+-\\=\\,\\.\\/\\\'\\;\\:\\"\\?\\>\\<\\[\\]\\\\\\{\\}\\|\\`\\#');
</script>
Read the W3C spec. (this is CSS 2.1; find the appropriate version for your assumption of browsers)
relevant paragraph:
In CSS, identifiers (including
element names, classes, and IDs in
selectors) can contain only the
characters [a-z0-9] and ISO 10646
characters U+00A1 and higher, plus the
hyphen (-) and the underscore (_);
they cannot start with a digit, or a
hyphen followed by a digit.
Identifiers can also contain escaped
characters and any ISO 10646 character
as a numeric code (see next item). For
instance, the identifier "B&W?" may be
written as "B&W?" or "B\26 W\3F".
As #mipadi points out in Kenan Banks's answer, there's this caveat, also in the same webpage:
In CSS, identifiers may begin with '-'
(dash) or '_' (underscore). Keywords
and property names beginning with '-'
or '_' are reserved for
vendor-specific extensions. Such
vendor-specific extensions should have
one of the following formats:
'-' + vendor identifier + '-' + meaningful name
'_' + vendor identifier + '-' + meaningful name
Example(s):
For example, if XYZ organization added
a property to describe the color of
the border on the East side of the
display, they might call it
-xyz-border-east-color.
Other known examples:
-moz-box-sizing
-moz-border-radius
-wap-accesskey
An initial dash or underscore is
guaranteed never to be used in a
property or keyword by any current or
future level of CSS. Thus typical CSS
implementations may not recognize such
properties and may ignore them
according to the rules for handling
parsing errors. However, because the
initial dash or underscore is part of
the grammar, CSS 2.1 implementers
should always be able to use a
CSS-conforming parser, whether or not
they support any vendor-specific
extensions.
Authors should avoid vendor-specific
extensions
The complete regular expression is:
-?(?:[_a-z]|[\200-\377]|\\[0-9a-f]{1,6}(\r\n|[ \t\r\n\f])?|\\[^\r\n\f0-9a-f])(?:[_a-z0-9-]|[\200-\377]|\\[0-9a-f]{1,6}(\r\n|[ \t\r\n\f])?|\\[^\r\n\f0-9a-f])*
So all of your listed characters, except “-” and “_” are not allowed if used directly. But you can encode them using a backslash foo\~bar or using the Unicode notation foo\7E bar.
For those looking for a workaround, you can use an attribute selector, for instance, if your class begins with a number. Change:
.000000-8{background:url(../../images/common/000000-0.8.png);} /* DOESN'T WORK!! */
to this:
[class="000000-8"]{background:url(../../images/common/000000-0.8.png);} /* WORKS :) */
Also, if there are multiple classes, you will need to specify them in selector or use the ~= operator:
[class~="000000-8"]{background:url(../../images/common/000000-0.8.png);}
Sources:
https://benfrain.com/when-and-where-you-can-use-numbers-in-id-and-class-names/
Is there a workaround to make CSS classes with names that start with numbers valid?
https://developer.mozilla.org/en-US/docs/Web/CSS/Attribute_selectors
My understanding is that the underscore is technically valid. Check out:
https://developer.mozilla.org/en/underscores_in_class_and_id_names
"...errata to the specification published in early 2001 made underscores legal for the first time."
The article linked above says never use them, then gives a list of browsers that don't support them, all of which are, in terms of numbers of users at least, long-redundant.
I would not recommend to use anything except A-z, _- and 0-9, while it's just easier to code with those symbols. Also do not start classes with - while those classes are usually browser-specific flags. To avoid any issues with IDE autocompletion, less complexity when you may need to generate those class names with some other code for whatever reason. Maybe some transpiling software may not work, etc., etc.
Yet CSS is quite loose on this. You can use any symbol, and even emoji works.
<style>
.😭 {
border: 2px solid blue;
width: 100px;
height: 100px;
overflow: hidden;
}
</style>
<div class="😭">
😅😅😅😅😅😅😅😅😅😅😅😅😅😅😅😅😅😅😅😅
</div>
We can use all characters in a class name. Even characters like # and .. We just have to escape them with \ (backslash).
.test\.123 {
color: red;
}
.test\#123 {
color: blue;
}
.test\#123 {
color: green;
}
.test\<123 {
color: brown;
}
.test\`123 {
color: purple;
}
.test\~123 {
color: tomato;
}
<div class="test.123">test.123</div>
<div class="test#123">test#123</div>
<div class="test#123">test#123</div>
<div class="test<123">test<123</div>
<div class="test`123">test`123</div>
<div class="test~123">test~123</div>
For HTML5 and CSS 3, classes and IDs can start with numbers.
Going off of Kenan Banks's answer, you can use the following two regex matches to make a string valid:
[^a-z0-9A-Z_-]
This is a reverse match that selects anything that isn't a letter, number, dash or underscore for easy removal.
^-*[0-9]+
This matches 0 or 1 dashes followed by 1 or more numbers at the beginning of a string, also for easy removal.
How I use it in PHP:
// Make alphanumeric with dashes and underscores (removes all other characters)
$class = preg_replace("/[^a-z0-9A-Z_-]/", "", $class);
// Classes only begin with an underscore or letter
$class = preg_replace("/^-*[0-9]+/", "", $class);
// Make sure the string is two or more characters long
return 2 <= strlen($class) ? $class : '';

How to select via regex all class notations in css file

By having a css file with css rules, I'd like to select only css class (i.e.) .tblGenFixed but not css values for a rule (i.e.) opacity: 0.3 .
This is my regex:
/(\.([\w_]+))/g
This is my alternative solution but it doesn't work
/(?!\{)(\.([\w_]+))(?!\})/g
I have set an example in regex101 here https://regex101.com/r/gG4nN4/1
How can I ignore css rule values ?
See this : Which characters are valid in CSS class names/selectors?
A value will have a digit after the dot. Luckily, valid CSS class names cannot start with a digit :)
Your regexp has to match a dot first, then a letter or - or _
! if you look for whitespace before the dot, a value like .5 will match ...
Try this one : (\.([a-zA-Z_-]{1}[\w-_]+))
Edit :
See this too : Regex to match a CSS class name
-?[_a-zA-Z]+[_a-zA-Z0-9-]*
Relevant quote :
Basically, a name must begin with an underscore (_), a hyphen (-), or a letter(a–z), followed by any number of hyphens, underscores, letters, or numbers. There is a catch: if the first character is a hyphen, the second character must be a letter or underscore, and the name must be at least 2 characters long.
Depending on how your CSS is written, you might be able to get what you are looking for by requiring whitespace before the period:
\W/(\.([\w_]+))/g
Here's a fork of your regex.
Depending on what you are looking for, you might want to skip one of those capture groups:
\W\.([\w_]+)
I'd also warn against parsing CSS with a regex without manually examining the results.

Regex, get only first occurrence and stop

I'm trying to grab each individual keyframes declaration in a css file, and copy it, but inserting moz/ms/o to handle each browser with keyframes.
I'm using this regex:
(#)(-webkit-)([\s\S]*)(\}\R\}\R#)
To try and capture each collection (see full example at my Rubular)
Try this:
/(#)(-webkit-)(.*?\R\})/m
The m modifier makes it a multi-line regexp, so . matches across newlines. I removed the match for # at the end, because then it can't match the last block in the file. And *? makes the match non-greedy, so it only matches one block at a time.
Rubular
The closest you get is...
(#-webkit-[^}]*}\s*to\s*{[^}]*}\s*})
...which can handle unusual/mangled indention in your CSS files decently. This is how it works:
( Start a capture group...
#-webkit- ...upon this phrase.
[^}]* } Continue until you you see a '}' character.
\s* to \s* { Next, the phrase ' to ', followed by '{'...
[^}]* } ...keep going till the next '}' character.
\s* } A final '}' character, possibly preceded by whitespace.
) Stop capturing.
It might be that there are cases where you have a false positive since regex doesn't understand nesting.

How do I match individual CSS attributes using RegEx

I'm trying to expand a minified CSS file (don't ask) to make it human readable.
I've managed to get most of the expanding done but I'm stuck at a very weird case that I can't figure out.
I have CSS that looks like this:
.innerRight {
border:0;color:#000;width:auto;padding-top:0;margin:0;
}
a {
color:#000;text-decoration:underline;font-size:12px;
}
p,small,ul,li {
color:#000;font-size:12px;padding:0;
}
I've tried (.+):(.+); as the search and \t\1: \2;\n as the replace. The find RegEx is valid, the only problem is that it matches the entire line of attributes. I've tried the non-greedy character, but I must not be putting it in the right place.
What the above find RegEx matches is:
0: border:0;color:#000;width:auto;padding-top:0;margin:0;
1: color:#000;text-decoration:underline;font-size:12px;
2: color:#000;font-size:12px;padding:0;
While those are technically correct matches, I need it to match border:0;, color:#000;, etc separately for my replace to work.
Try this - use non-greedy matching. This works for me
(.+?):(.+?);
Forget the colon. Just replace all semicolons with ";\n".
In Javascript, for example, you could write:
text = text.replace(/;/gm,";\n");
I would further refine that to address leading-space issues, etc., but this will put every style rule on its own line.

Trying to remove hex codes from regular expression results

My first question here at so!
To the point;
I'm pretty newbish when it comes to regular expressions.
To learn it a bit better and create something I can actually use, I'm trying to create a regexp that will find all the CSS tags in a CSS file.
So far, I'm using:
[#.]([a-zA-Z0-9_\-])*
Which is working pretty fine and finds the #TB_window as well as the #TB_window img#TB_Image and the .TB_Image#TB_window.
The problem is it also finds the hex code tags in the CSS file. ie #FFF or #eaeaea.
The .png or .jpg or and 0.75 are found as well..
Actually it's pretty logical that they are found, but aren't there smart workarounds for that?
Like excluding anything between the brackets {..}?
(I'm pretty sure that's possible, but my regexp experience is not much yet).
Thanks in advance!
Cheers!
Mike
CSS is a very simple, regular language, which means it can be completely parsed by Regex. All there is to it are groups of selectors, each followed by a group of options separated by colons.
Note that all regexes in this post should have the verbose and dotall flags set (/s and /x in some languages, re.DOTALL and re.VERBOSE in Python).
To get pairs of (selectors, rules):
\s* # Match any initial space
([^{}]+?) # Ungreedily match a string of characters that are not curly braces.
\s* # Arbitrary spacing again.
\{ # Opening brace.
\s* # Arbitrary spacing again.
(.*?) # Ungreedily match anything any number of times.
\s* # Arbitrary spacing again.
\} # Closing brace.
This will not work in the rare case of having a quoted curly bracket in an attribute selector (e.g. img[src~='{abc}']) or in a rule (e.g. background: url('images/ab{c}.jpg')). This can be fixed by complicating the regex some more:
\s* # Match any initial space
((?: # Start the selectors capture group.
[^{}\"\'] # Any character other than braces or quotes.
| # OR
\" # An opening double quote.
(?:[^\"\\]|\\.)* # Either a neither-quote-not-backslash, or an escaped character.
\" # And a closing double quote.
| # OR
\'(?:[^\']|\\.)*\' # Same as above, but for single quotes.
)+?) # Ungreedily match all that once or more.
\s* # Arbitrary spacing again.
\{ # Opening brace.
\s* # Arbitrary spacing again.
((?:[^{}\"\']|\"(?:[^\"\\]|\\.)*\"|\'(?:[^\'\\]|\\.)*\')*?)
# The above line is the same as the one in the selector capture group.
\s* # Arbitrary spacing again.
\} # Closing brace.
# This will even correctly identify escaped quotes.
Woah, that's a handful. But if you approach it in a modular fashion, you'll notice it's not as complex as it seems at first glance.
Now, to split selectors and rules, we go have to match strings of characters that are either non-delimiters (where a delimiter is the comma for selectors and a semicolon for rules) or quoted strings with anything inside. We'll use the same pattern we used above.
For selectors:
\s* # Match any initial space
((?: # Start the selectors capture group.
[^,\"\'] # Any character other than commas or quotes.
| # OR
\" # An opening double quote.
(?:[^\"\\]|\\.)* # Either a neither-quote-not-backslash, or an escaped character.
\" # And a closing double quote.
| # OR
\'(?:[^\'\\]|\\.)*\' # Same as above, but for single quotes.
)+?) # Ungreedily match all that.
\s* # Arbitrary spacing.
(?:,|$) # Followed by a comma or the end of a string.
For rules:
\s* # Match any initial space
((?: # Start the selectors capture group.
[^,\"\'] # Any character other than commas or quotes.
| # OR
\" # An opening double quote.
(?:[^\"\\]|\\.)* # Either a neither-quote-not-backslash, or an escaped character.
\" # And a closing double quote.
| # OR
\'(?:[^\'\\]|\\.)*\' # Same as above, but for single quotes.
)+?) # Ungreedily match all that.
\s* # Arbitrary spacing.
(?:;|$) # Followed by a semicolon or the end of a string.
Finally, for each rule, we can split (once!) on a colon to get a property-value pair.
Putting that all together into a Python program (the regexes are the same as above, but non-verbose to save space):
import re
CSS_FILENAME = 'C:/Users/Max/frame.css'
RE_BLOCK = re.compile(r'\s*((?:[^{}"\'\\]|\"(?:[^"\\]|\\.)*"|\'(?:[^\'\\]|\\.)*\')+?)\s*\{\s*((?:[^{}"\'\\]|"(?:[^"\\]|\\.)*"|\'(?:[^\'\\]|\\.)*\')*?)\s*\}', re.DOTALL)
RE_SELECTOR = re.compile(r'\s*((?:[^,"\'\\]|\"(?:[^"\\]|\\.)*\"|\'(?:[^\'\\]|\\.)*\')+?)\s*(?:,|$)', re.DOTALL)
RE_RULE = re.compile(r'\s*((?:[^;"\'\\]|\"(?:[^"\\]|\\.)*\"|\'(?:[^\'\\]|\\.)*\')+?)\s*(?:;|$)', re.DOTALL)
css = open(CSS_FILENAME).read()
print [(RE_SELECTOR.findall(i),
[re.split('\s*:\s*', k, 1)
for k in RE_RULE.findall(j)])
for i, j in RE_BLOCK.findall(css)]
For this sample CSS:
body, p#abc, #cde, a img .fgh, * {
font-size: normal; background-color: white !important;
-webkit-box-shadow: none
}
#test[src~='{a\'bc}'], .tester {
-webkit-transition: opacity 0.35s linear;
background: white !important url("abc\"cd'{e}.jpg");
border-radius: 20px;
opacity: 0;
-webkit-box-shadow: rgba(0, 0, 0, 0.6) 0px 0px 18px;
}
span {display: block;} .nothing{}
... we get (spaced for clarity):
[(['body',
'p#abc',
'#cde',
'a img .fgh',
'*'],
[['font-size', 'normal'],
['background-color', 'white !important'],
['-webkit-box-shadow', 'none']]),
(["#test[src~='{a\\'bc}']",
'.tester'],
[['-webkit-transition', 'opacity 0.35s linear'],
['background', 'white !important url("abc\\"cd\'{e}.jpg")'],
['border-radius', '20px'],
['opacity', '0'],
['-webkit-box-shadow', 'rgba(0, 0, 0, 0.6) 0px 0px 18px']]),
(['span'],
[['display', 'block']]),
(['.nothing'],
[])]
Simple exercise for the reader: write a regex to remove CSS comments (/* ... */).
What about this:
([#.]\S+\s*,?)+(?=\{)
First off, I don't see how the RE you posted would find .TB_Image#TB_window. You could do something like:
/^[#\.]([a-zA-Z0-9_\-]*)\s*{?\s*$/
This would find any occurrences of # or . at the beginning of the line, followed by the tag, optionally followed by a { and then a newline.
Note that this would NOT work for lines like .TB_Image { something: 0; } (all on one line) or div.mydivclass since the . is not at the beginning of the line.
Edit: I don't think nested braces are allowed in CSS, so if you read in all the data and get rid of newlines, you could do something like:
/([a-zA-Z0-9_\-]*([#\.][a-zA-Z0-9_\-]+)+\s*,?\s*)+{.*}/
There's a way to tell a regex to ignore newlines as well, but I never seem to get that right.
It's actually not an easy task to solve with regular expressions since there are a lot of possibilities, consider:
descendant selectors like #someid ul img -- those are all valid tags and are separated by spaces
tags that don't start with . or # (i.e. HTML tag names) -- you have to provide a list of those in order to match them since they have no other difference from attributes
comments
more that I can't think of right now
I think you should instead consider some CSS parsing library suitable for your preferred language.

Resources