What is wrong with this Regex "^(.|\s){1,280}$" - asp.net

Should be validating 1-280 input characters, but it hangs when more than 280 characters are input.
Clarification
I am using the above regex to validate the length of input string to be 280 characters maximum.
I am using asp:RegularExpressionValidator to do that.

There's nothing “wrong” with it per se, but it's horrendous because with most RE engines (you don't say which one you're using) when it doesn't match with the first thing it tries because it causes the engine to backtrack and try loads of different possibilities (none of which can ever cause a match). So it's not a hang, but rather just a machine that's trying to execute around 2280 operations to see if there's a match possible. Excuse me if I don't wait around for that!
Of course, it's theoretically possible for the RE compiler to merge the (.|\s) part of the RE into something it doesn't need to backtrack to deal with. Some RE engines do this (typically the more automata-theoretic ones) but many don't (the stack-based ones).

It is trying every possible combination of . and \s for each character trying to find a version of the pattern that matches the string.
. already matches any character, so (.|\s) is redundant. Further, if you just want to check what the length of the string is, then just do that - why are you pulling out regexes?

If you really want to use a regular expression, you could use .{1, 280}$ combined with the SingleLine option, so that the . metacharacter will match everything, including new lines (see here, Regular Expression API section).

Related

Can I extract all matches with functions like aregexec?

I've been enjoying the powerful function aregexec that allows me to mine strings in a fuzzy way.
For that I can search for a string of nucleotide "ATGGCTTCGTC" within a DNA section with defined allowance of insertion, deletion and substitute.
However, it only show me the first match without finishing the whole string. For example,
If I run
aregexec("a","adfasdfasdfaa")
only the first "a" will show up from the result. I'd like to see all the matches.
I wonder if there are other more powerful functions or a argument to be added to this one.
Thank you very much.
P.S. I explained the fuzzy search poorly. I mean, the match doesn't have to be perfect. Say if I allow an substitution of one character, and search AATTGG in ctagtactaAATGGGatctgct, the capital part will be considered a match. I can similarly allow insertions and deletions of certain characters.
gregexpr will show every time there is the pattern in the string, like in this example.
gregexpr("as","adfasdfasdfaa")
There are many more information if you use ?grep in R, it will explain every aspect of using regex.

zsh : Removing run of certain characters at the end of a variable

To keep it simple, consider following task:
Given a non-empty shell variable var, where we know that the last character is the letter a and that at least one character is not an a, remove all the a from the right of the variable.
Example: If the variable initially contains abcadeaaa, it should contain abcade afterwards.
I wondering whether this can be done in a compact way in Zsh.
Of course this is trivial to do using an external program (such as sed), or by using a while loop, where we consecutively strip the last a (${VAR%a}), until the value of the variable doesn't change anymore. Both would work in a POSIX shell, in bash or in ksh. However, given that Zsh has so many nice features for expansion, I wonder, whether there isn't a better way.
The problem is that matching a run of a certain character (independend of length) cries out for regular expressions, but the pattern after % in parameter expansion, and the pattern in the s/// substitution, both is a wildcard pattern, which doesn't allow me to do what I want - at least according to my understanding of the zshexpn man page.
Any ideas for this?
Note that this question is not to solve a real-world problem (in which case I simply would use sed to do the job, as this would get the job done), but more out of academic interest, to find out how far we can stretch the limits of the Zsh expansion mechanism.

ASP.Net Regular Expression Validator no spaces

I'm trying to set up a validation expression for an ASP.Net Regular Expression Validator control. It is for validating the creation of a user name, so I want to limit the number of characters, and I also want to prevent them from using spaces. Here's what I've got so far:
^.*(?=.{5,20})(?=.*\w{5,255}).*$
The \w{5,255} part prevents spaces and special characters (except for underscores, apparently). I have no idea how "5,255" makes it work, but it does; I just copied it from somewhere else.
The main problem I'm having is that if the first or last character is a space (or special character), it passes validation, which is not acceptable. Can anyone help me? I'm sure it is something simple, but I know next to nothing about regular expressions.
You can use something simpler like this:
^[a-zA-Z0-9_]{5,255}$
This will allow alphanumeric usernames between 5-255 characters in length.
(let's expand overall understanding of how to at least use regex!)
The main reason why the posted regex wasn't working is because you were attempting to use lookahead. Lookahead is a 0-length pattern that just guarantees that the next part of the string will match a certain pattern (and is usually used to take advantage of it being 0-length, so it doesn't expand your capturing group).
Effectively, what your regex (going off of the original /^.(?=.{5,20})(?=.\w{5,255}).*$/) meant was:
^. "The beginning of our line should match any single character (provided it's not a newline, although this depends on the regex implementation as well as flags that may or may not have been passed in)"
(?= "and guarantee that after here"
.{5,20}) are any 5-20 characters."
(?= "Also, after that same first character (since, remember, lookahead is 0-length), guarantee"
. "one arbitrary character"
\w{5,255}) "and 5-255 word characters."
.*$ And of course, since all of that exhaustive matching was 0-length, we want the rest of the line to be an arbitrary number of characters."
What you technically could have done to use lookaround was ^(?=\w{5,255}).{5,255}$, but that's just overly convoluted. I'd suggest just using \w{5,255} or something along those lines.

Find HEX patterns and number of occurrences

I'd like to find patterns and sort them by number of occurrences on an HEX file I have.
I am not looking for some specific pattern, just to make some statistics of the occurrences happening there and sort them.
DB0DDAEEDAF7DAF5DB1FDB1DDB20DB1BDAFCDAFBDB1FDB18DB23DB06DB21DB15DB25DB1DDB2EDB36DB43DB59DB32DB28DB2ADB46DB6FDB32DB44DB40DB50DB87DBB0DBA1DBABDBA0DB9ADBA6DBACDBA0DB96DB95DBB7DBCFDBCBDBD6DB9CDBB5DB9DDB9FDBA3DB88DB89DB93DBA5DB9CDBC1DBC1DBC6DBC3DBC9DBB3DBB8DBB6DBC8DBA8DBB6DBA2DB98DBA9DBB9DBDBDBD5DBD9DBC3DB9BDBA2DB84DB83DB7DDB6BDB58DB4EDB42DB16DB0DDB01DB02DAFCDAE9DAE5DAD9DAE2DAB7DA9BDAA6DA9EDAAADAC9DACADAC4DA92DA90DA84DA89DA93DAA9DA8CDA7FDA62DA53DA6EDA
That's an excerpt of the HEX file, and as an example I'd like to get:
XX occurrences of BDBDBD
XX occurrences of B93D
Is there a way to mine the file to generate that output?
Sure. Use a sliding window to create the counts (The link is for Perl, but it seems general enough to understand the algorithm). Your patterns are named N-grams. You will have to limit the maximal pattern, though.
This is a pretty classic CS problem. The code in general is non-trivial to implement as it will require at least one full parse of the sequence, and depending on your efficiency and memory/processor constraints might require several. See here.
You will need to partition your input string in some way to ensure that you get a good subsequence across it.
If there is a specific problem we might be able to help more, but the general strategy is in the Wikipedia article above.
You can use Regular Expressions to make a pattern to search for.
The regex needed would be very simple. Just use the exact phrase you're searching for. Then there should be a regular expression function in the language you're using (you didn't specify) that can count the number of matches.
Use that to create a simple counter.

Use a RegularExpressionValidator to limit a word count?

I want to use an ASP.NET RegularExpressionValidator to limit the number of words in a text box. (The RegularExpressionValidator is my favoured solution because it will do both client and server side checks).
So what would be the correct Regex to put in the RegularExpressionValidator that will count the words and enforce a word-limit? For, lets say, 150 words.
(NB: I see that this question is similar, but the answers given seem to also rely on code such as Split() so I don't think any of them could plug into a RegularExpressionValidator which is why I'm asking again)
Since ^ and $ is implicitly set with RegularExpressionValidators, use the following:
(\S*\s*){0,10}
The 0 here allows empty strings (more specifically 0 words) and 150 is the max number of words to accept. Adjust these as necessary to increase/decrease the number of words accepted.
The above regex is non-greedy, so you'll get a quicker match verses the one given in the question you reference. (\b.*\b){0,10} is greedy, so as you increased the number of words you'll see a decrease in performance.
Here is a quick reference for regular expressions:
http://msdn.microsoft.com/en-us/library/az24scfc.aspx
You can use this site to test the expressions:
http://regexpal.com/
Here is my regex example that works with both minimum and maximum word count (and fixes bug with leading spacing):
^\s*(\S+\s+|\S+$){10,150}$
Check this site:
http://lawrence.ecorp.net/inet/samples/regexp-validate.php#count
It's JavaScript RegEx, but it's very similar to asp.net
It's something like this:
(\b[a-z0-9]+\b.*){4,}

Resources