I would like to search all over thousands of HTML code for bad practice of height, width or any other CSS.
for instance I would like to get all places where height is not provided with units, for instance height:40 should be found, but height:40px shouldn't.
For that I am using the search program agent ransack, in which I can put regular expression to search within files.
Currently my regular expression is:
(height:)[\s]*[0-9]*\.?[0-9]+(px)
this finds everything that is like height:40px. (Later on I want to add width, or other things)
My question is how to make a NOT on top of all that?
Or is there any other good application to search files for regular expressions?
Use a negative lookahead (?!regex), eg:
height:\s*\d+(?:\.\d+)?(?!px|\d)
\d is needed to prevent backtracking alternatives from matching.
Consider using a program that already does this, and add you own rules. In general, the practice of fixing up source code is called 'linting'. So, you can find quickly something like CSSLint which is open source and allows custom rules:
https://github.com/stubbornella/csslint/wiki/Rules
http://csslint.net/
Use the negative lookahead this way:
(?!height:\s*\d+(?:\.\d+)?px)height:\s*\d+(?:\.\d+)?
Considering you are using css, you may want to include other units as valid ones like pt, em, %, etc... like the below regex
(?!height:\s*\d+(?:\.\d+)?(?:px|pt|em|%|cm|mm|in|ex|pc))height:\s*\d+(?:\.\d+)?
You can test it over Rubular
Related
I'm looking to write a regular expression to validate a potential web address.
In 'http://www.microsoft.com' for example, I would like the make the 'http://' optional so if only 'www.microsoft.com' were entered into my textbox, it would still work.
I've done some research on regular expressions and my question specifically, but I'm not getting anywhere with finding one or really understanding how to write one.
I already have the regex provided in VS to validate an internet address, I'm more unsure of how to modify it to make parts optional.
Regular Expressions are kind of difficult (in my opinion). If you want to use Regex, more power to you.
You could use something simple like this, too.
If (links.StartsWith("https://") or links.StartsWith("http://") or links.StartsWith("www.")) Then
//links are valid
End If
I'm just learning how to program in ruby using nokogiri gem.
doc.xpath("//*[#class='someclass']//#href")
will return all href values under "someclass" class somewhere in the HTML.
doc.xpath("//*[#class='someclass']").xpath("//#href")
will return all href in entire HTML.
Could someone explain to me how would someone go about using //# equivalent in xpath for instance, within parsed data so something like:
doc.xpath("//*[#class='someclass']").xpath(grab all the href within previously parsed)
is possible?
using the *, # seems to be quite powerful but I can't seem to be able to narrow that down, other than searching through entire HTML, whereever I use it..
as a beginner, I just thought it would be.. intuitive? to be able to use "grab from everywhere" type of syntax limited to what has been parsed previously to narrow down my target, so I can do something like
xpath(whatever).css(whatever).xpath(whatever)
maybe this is not a good practice? maybe with more understanding of parsing concept I would never have to do this? sometimes I find using both xpath and CSS easier..
hopefully someone can enlighten me..
Try changing your second expression from
doc.xpath("//*[#class='someclass']").xpath("//#href")
to
doc.xpath("//*[#class='someclass']").xpath(".//#href")
// at the beginning of an XPath expression means "descendants of the root of the document," whereas .// means "descendants of the context node(s)."
You're right that XPath is powerful, and some major aspects of it are intuitive... but there are significant pieces that aren't intuitive, or depend on how your intuition is trained. Careful study reaps dividends, especially if you are going to use XPath much!
What I want to do is write a selector that matches an arbitrary value in one place, then later requires a different value be equal to it. If [attr="value"] parsed "value" as a regex, then this would solve my problem:
*[class="(.+)"] *[class="if_\1"] {/* styles */}
Obviously I could just list each possible class individually, but that takes a great deal of convenience out of it.
Is this possible?
No, it's not possible. Attribute selectors are almost completely static and provide almost no dynamic matching functionality (beyond that of substring matches, which are still static and not dynamic pattern-based). They do not support anything like what you see in everyday regular expressions.
A stylesheet preprocessor such as Sass or LESS will allow you to generate the static CSS rules needed, but all that does is automate the manual task of listing all possible values individually, which proves my first point.
I have a multiline textbox (textarea) that I want to verify has a particular string in it. I was trying:
<asp:RegularExpressionValidator runat="server" ControlToValidate="txtTemplate" ValidationExpression="^(.\s*)*Content(.\s*)*$" Text="content" ErrorMessage="Must contain: Content" />
Using ^(.\s*)*$ seems to pass for a textarea. So I tried to sandwich my criteria between two of these. But it seems to lock up both IE and Chrome.
This should be simple, I think I'm making it tougher than it needs to be.
If the validation is always being done on the server (that's what runat="server" means, isn't it?), the simplest solution is probably to use this regex:
(?s)^.*Content.*$
(?s) turns on Singleline mode, which allows the . metacharacter to match all characters including linefeeds. If you want it to run on the client as well, use this:
^[\s\S]*Content[\s\S]*$
That's because JavaScript has no equivalent for Singleline mode (also known as DOT_ALL, DOTALL, dot-matches-all, single-line, or /s mode). It doesn't recognize inline modifiers like (?s) and (?i), either.
Watch out for constructs like (.\s*)*, where an expression with quantifiers (*, +, etc.) is enclosed in a group which is itself controlled by a quantifier. If the regex fails to achieve a match right away, it goes back and tries to match by different paths (i.e., by using different parts of the regex to match different parts of the string), which can get very expensive, performance-wise. This regex is especially bad because . and \s can match many of the same characters, which dramatically increases the number of paths it has to explore before giving up.
The phenomenon is commonly known as catastrophic backtracking, and it usually manifests in cases where there's no possibility of a match. I would expect your validator to work fine when the sequence Content is present.
By the way, if you want to match only on the complete word Content, you should add word boundaries, like so:
(?s)^.*\bContent\b.*$
That will prevent false positives on words like MalContent and Contentious. \b works differently in different regex flavors. In .NET it's Unicode-aware unless you specify ECMAScript mode. In JavaScript it's supposed to recognize only the ASCII letters and digits as word characters; in most browsers it does, but don't take it for granted.
Try
[\S\s]*Content[\S\s]*
I think a regex more like .*Content.* would be more effective and possibly faster. Also, you may want to implement a custom validator if this continues to be a performance drag, where you use JavaScript to search the text for the string.
I'm implementing a simple search on a website, and right now I'm working on sanitizing the input. My plan is to make a whitelist of allowed characters. I'm using PHP, and so far I've got the current regex:
preg_replace('/[^a-z0-9 -]/i', '', $s);
So, I'm removing anything that's not alphanumeric or a space or a hyphen.
Is there a generally accepted whitelist for this sort of thing, or does it just depend on the application? I'm going to be searching on book titles, author names and book blurbs.
What about 2010 (A space odyssey)? What about Giscard d`Estaing's autobiography? ... This is really impossible to answer generally, it will depend on your application and data structures.
You want to look into the fulltext search functions of the database of your choice, or even specialized search appliances like Sphinx.
Clarify what engine you will use first to actually perform your search, and the rules on what you need to strip out will become much clearer.
Google has some pretty advanced rules for searches, but their basic rule is this:
Generally, punctuation is ignored, including ##$%^&*()=+[]\ and other special characters.
However, Google makes exceptions for common search terms, like C++, C#, or $100.
If you want a search as sophisticated as Google's, you can make rules against the above punctuation and have some exceptions. However, for a simple search, just ignore the characters that Google generally ignores.
There's not a generic regular expression to solve this problem. Your code strips out a lot of things you might want to keep, like commas, exclamation points, (semi-)colons, and non-English letters. If you have a full list of all of the titles in your database, you should be able to write a script that will construct a list of all characters found in all of your titles. If your regular expression strips out any of those characters, then you risk having problems (although passing this test doesn't mean that you won't run into problems).
Depending on how the rest of your search is implemented, you may be able to strip out valid characters and still return relevant search results. In this case, you would want your expression to allow non-English characters (since you don't want to split a word) but you might be able to remove all punctuation marks that aren't inside of a quote-delimited phrase. For example, searching for red haired should give you all of the results you would get from searching for red-haired plus a few extra.