ASPX attribute regex parsing in c# - asp.net

I need to find attribute values in an ASPX file using regular expressions.
That means you don't need to worry about malformed HTML or any HTML related issues.
I need to find the value of a particular attribute (LocText). I want to get what's inside the quotes.
Any ASPX tags such as <%=, <%#, <%$ etc. inside the value don't make sense for this attribute therefore are considered as part of it.
The regex I began with looks like this:
LocText="([^"]+)"
This works great, the first group, which is the result text, gets everything except the double quotes, which are not allowed there (&quot ; must be used instead)
But the ASPX file allows using of single quotes - second regular expression must be applied then.
LocText='([^']+)'
I could use these two regular expressions but I'm looking for a way to connect them.
LocText=("([^"]+)"|'([^']+)')
This also works but doesn't seem very efficient as it's creating unnecessary number of groups. I think this could be somehow done by using backreferences, but I can't get it to work.
LocText=(["']{1})([^\1]+)\1
I thought that by this, I save the single/double quote to the first group and then I tell it to read anything that is NOT the char found in the first group. This is enclosed again by the quote from the first group. Obviously, I'm wrong and it's not working like that.
Is there any way, how to connect the first two expressions together creating just a minimum amount of groups with one group being the value of the attribute I want to get? Is it possible using a backreference for the single/double quote value, or have I completely misunderstood the meaning of them?

I'd say your solution with alternation isn't that bad, but you could use named captures so the result will always be found in the same group's value:
Regex regexObj = new Regex(#"LocText=(?:""(?<attr>[^""]+)""|'(?<attr>[^']+)')");
resultString = regexObj.Match(subjectString).Groups["attr"].Value;
Explanation:
LocText= # Match LocText=
(?: # Either match
"(?<attr>[^"]+)" # "...", capture in named group <attr>
| # or match
'(?<attr>[^']+)' # '...', also capture in named group <attr>
) # End of alternation
Another option would be to use lookahead assertions ([^\1] isn't working because you can't place backreferences inside a character class, but you can use them in lookarounds):
Regex regexObj = new Regex(#"LocText=([""'])((?:(?!\1).)*)\1");
resultString = regexObj.Match(subjectString).Groups[2].Value;
Explanation:
LocText= # Match LocText=
(["']) # Match and capture (group 1) " or '
( # Match and capture (group 2)...
(?: # Try to match...
(?!\1) # (unless it's the quote character we matched before)
. # any character
)* # repeat any number of times
) # End of capturing group 2
\1 # Match the previous quote character

Related

Extract mm/dd/yyyy and m/dd/yyyy dates from string in R [duplicate]

My regex pattern looks something like
<xxxx location="file path/level1/level2" xxxx some="xxx">
I am only interested in the part in quotes assigned to location. Shouldn't it be as easy as below without the greedy switch?
/.*location="(.*)".*/
Does not seem to work.
You need to make your regular expression lazy/non-greedy, because by default, "(.*)" will match all of "file path/level1/level2" xxx some="xxx".
Instead you can make your dot-star non-greedy, which will make it match as few characters as possible:
/location="(.*?)"/
Adding a ? on a quantifier (?, * or +) makes it non-greedy.
Note: this is only available in regex engines which implement the Perl 5 extensions (Java, Ruby, Python, etc) but not in "traditional" regex engines (including Awk, sed, grep without -P, etc.).
location="(.*)" will match from the " after location= until the " after some="xxx unless you make it non-greedy.
So you either need .*? (i.e. make it non-greedy by adding ?) or better replace .* with [^"]*.
[^"] Matches any character except for a " <quotation-mark>
More generic: [^abc] - Matches any character except for an a, b or c
How about
.*location="([^"]*)".*
This avoids the unlimited search with .* and will match exactly to the first quote.
Use non-greedy matching, if your engine supports it. Add the ? inside the capture.
/location="(.*?)"/
Use of Lazy quantifiers ? with no global flag is the answer.
Eg,
If you had global flag /g then, it would have matched all the lowest length matches as below.
Here's another way.
Here's the one you want. This is lazy [\s\S]*?
The first item:
[\s\S]*?(?:location="[^"]*")[\s\S]* Replace with: $1
Explaination: https://regex101.com/r/ZcqcUm/2
For completeness, this gets the last one. This is greedy [\s\S]*
The last item:[\s\S]*(?:location="([^"]*)")[\s\S]*
Replace with: $1
Explaination: https://regex101.com/r/LXSPDp/3
There's only 1 difference between these two regular expressions and that is the ?
The other answers here fail to spell out a full solution for regex versions which don't support non-greedy matching. The greedy quantifiers (.*?, .+? etc) are a Perl 5 extension which isn't supported in traditional regular expressions.
If your stopping condition is a single character, the solution is easy; instead of
a(.*?)b
you can match
a[^ab]*b
i.e specify a character class which excludes the starting and ending delimiiters.
In the more general case, you can painstakingly construct an expression like
start(|[^e]|e(|[^n]|n(|[^d])))end
to capture a match between start and the first occurrence of end. Notice how the subexpression with nested parentheses spells out a number of alternatives which between them allow e only if it isn't followed by nd and so forth, and also take care to cover the empty string as one alternative which doesn't match whatever is disallowed at that particular point.
Of course, the correct approach in most cases is to use a proper parser for the format you are trying to parse, but sometimes, maybe one isn't available, or maybe the specialized tool you are using is insisting on a regular expression and nothing else.
Because you are using quantified subpattern and as descried in Perl Doc,
By default, a quantified subpattern is "greedy", that is, it will
match as many times as possible (given a particular starting location)
while still allowing the rest of the pattern to match. If you want it
to match the minimum number of times possible, follow the quantifier
with a "?" . Note that the meanings don't change, just the
"greediness":
*? //Match 0 or more times, not greedily (minimum matches)
+? //Match 1 or more times, not greedily
Thus, to allow your quantified pattern to make minimum match, follow it by ? :
/location="(.*?)"/
import regex
text = 'ask her to call Mary back when she comes back'
p = r'(?i)(?s)call(.*?)back'
for match in regex.finditer(p, str(text)):
print (match.group(1))
Output:
Mary

can someone explain this regular expression inside gsub()? [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 4 years ago.
I'm trying to understand a regular expression someone has written in the gsub() function.
I've never used regular expressions before seeing this code, and i have tried to work out how it's getting the final result with some googling, but i have hit a wall so to speak.
gsub('.*(.{2}$)', '\\1',"my big fluffy cat")
This code returns the last two characters in the given string. In the above example it would return "at". This is the expected result but from my brief foray into regular expressions i don't understand why this code does what it does.
What i understand is the '.*' means look for any character 0 or more times. So it's going to look at the entire string and this is what will be replaced.
The part in brackets looks for any two characters at the end of the string. It would make more sense to me if this part in brackets was in place of the '\1'. To me it would then read look at the entire string and replace it with the last two characters of that string.
All that does though is output the actual code as the replacement e.g ".{2}$".
Finally i don't understand why '\1' is in the replace part of the function. To me this is just saying replace the entire string with a single backslash and the number one. I say a single backslash because it's my understanding the first backslash is just there to make the second backslash a none special character.
For gsub there are two ways of using the function. The most common way is probably.
gsub("-","TEST","This is a - ")
which would return
This is a TEST
What this does is simply finds the matches in the regular expression and replaces it with the replacement string.
The second way to use gsub is the method in which you described. using \\1, \\2 or \\3...
What this does is looks at the first, second or third capture group in your regular expression.
A capture group is defined by anything inside the circular brackets ex: (capture_group_1)(capture_group_2)...
Explanation
Your analysis is correct.
What i understand is the '.*' means look for any character 0 or more times. So it's going to look at the entire string and this is what will be replaced.
The part in brackets looks for any two characters at the end of the string
The last two characters are placed in a capture group and we are simply replace the whole string with this capture group. Not replacing them with anything.
if it helps, check out the result of this expression.
gsub('(.*)(.{2}$)', 'Group 1: \\1, Group 2: \\2',"my big fluffy cat")
hope the examples can help you to understand it better:
Say we have a string foobarabcabcdef
.* matches whole string.
.*abc it matches: from the beginning matches any chars till the last abc (greedy matching), thus, it matches foobarabcabc
.*(...)$ matches the whole string as well, however, the last 3 chars were groupped. Without the () , the matched string will have a default group, group0, the () will be group1, 2, 3.... think about .*(...)(...)(...)$ so we have:
group 0 : whole string
group 1 : "abc" the first "abc"
group 2 : "abc" the 2nd "abc"
group 3 : "def" the last 3 chars
So back to your example, the \\1 is a reference to group. What it does is: "replace the whole string by the matched text in group1" That is, the .{2}$ part is the replacement.
If you don't understand the backslashs, you have to reference the syntax of r, I cannot tell more. It is all about escaping.
Important part of that regular expression are brackets, that's something called "capturing group".
Regular expression .*(.{2}$) says - match anything and capture last 2 characters at the line. Replacement \\1 is referencing to that group, so it will replace whole match with captured group, which are last two characters in this case.

TextPad Find Replace Commands Wild Cards

I am trying to figure out how I can put together a find and replace command with wildcards or figure out a way to find and replace the following example:
I would like to find terms that contain double quotes in front of them with a single quote at the end:
Example:
find "joe' and replace with 'joe'
Basically, I'm trying to find all terms with terms having "in front and at the end.'
Check the [x] Regular expression checkbox in textpad's replace dialog and enter the following values:
Find what:
"([^'"]*)'
Replace with:
'\1'
Explanation:
In a regular expression, square brackets are used to indicate character classes. A character class beginning with a caret will match anything not in the class.
Thus [^'"] will match any character except ' and ". The following * indicates that any number of these characters can follow. The ( and ) mark a group. And the group we're looking for starts with " and ends with '. Finally in the replace string we can refer to any group via \n where n is the nth group. In our case it is the first and only group and that is why we used \1.

How to write a regex for any text except quotes or multiple hyphens?

Can anybody tell me how to write a regular expression for "no quotes (single or double) allowed and only single hyphens allowed"? For example, "good", 'good', good--looking are not allowed (but good-looking is).
I need put this regex like following:
<asp:RegularExpressionValidator ID="revProductName" runat="server"
ErrorMessage="Can not have " or '." Font-Size="Smaller"
ControlToValidate="txtProductName"
ValidationExpression="^[^'|\"]*$"></asp:RegularExpressionValidator>
The one I have is for double and single quotes. Now I need add multiple hyphens in there. I put like this "^[^'|\"|--]*$", but it is not working.
^(?:-(?!-)|[^'"-]++)*$
should do.
^ # Start of string
(?: # Either match...
-(?!-) # a hyphen, unless followed by another hyphen
| # or
[^'"-]++ # one or more characters except quotes/hyphen (possessive match)
)* # any number of times
$ # End of string
So, the regexp has to fail when ther is ', or ", or --.
So, the regexp should try this in every position, and if it's found, then fail:
^(?:(?!['"]|--).)*$
The idea is to consume all the line with ., but to check before using . each time that it not ', or ", or the beginning of --.
Also, I like the other answer very much. It uses a bit different approach. It consumes only non-'" symbols ([^'"]), and if it consumes -, it check if it's not followed by another -.
Also, there could be one more approach of searching for ', or ", or -- in the string, and then failing the regex if they are found. I could be achieved by using regex conditional expression. But this flavor of regex engine doesn't seem to support such kind of conditions.

ASP.NET regular expression to restrict consecutive characters

Using ASP.NET syntax for the RegularExpressionValidator control, how do you specify restriction of two consecutive characters, say character 'x'?
You can provide a regex like the following:
(\\w)\\1+
(\\w) will match any word character, and \\1+ will match whatever character was matched with (\\w).
I do not have access to asp.net at the moment, but take this console app as an example:
Console.WriteLine(regex.IsMatch("hello") ? "Not valid" : "Valid"); // Hello contains to consecutive l:s, hence not valid
Console.WriteLine(regex.IsMatch("Bar") ? "Not valid" : "Valid"); // Bar does not contain any consecutive characters, so it's valid
Alexn is right, this is the way you match consecutive characters with a regex, i.e. (a)\1 matches aa.
However, I think this is a case of everything looking like a nail when you're holding a hammer. I would not use regex to validate this input. Rather, I suggest validating this in code (just looping through the string, comparing str[i] and str[i-1], checking for this condition).
This should work:
^((?<char>\w)(?!\k<char>))*$
It matches abc, but not abbc.
The key is to use so called "zero-width negative lookahead assertion" (syntax: (?! subexpression)).
Here we make sure that a group matched with (?<char>\w) is not followed by itself (expressed with (?!\k<char>)).
Note that \w can be replaced with any valid set of characters (\w does not match white-spaces characters).
You can also do it without named group (note that the referenced group has number 2):
^((\w)(?!\2))*$
And its important to start with ^ and end with $ to match the whole text.
If you want to only exclude text with consecutive x characters, you may use this
^((?<char>x)(?!\k<char>)|[^x\W])*$
or without backreferences
^(x(?!x)|[^x\W])*$
All syntax elements for .NET Framework Regular Expressions are explained here.
You can use a regex to validate what's wrong as well as what's right of course. The regex (.)\1 will match any two consecutive characters, so you can just reject any input that gives an IsValid result to that. If this is the only validation you need, I think this way is far easier than trying to come up with a regex to validate correct input instead.

Resources