Regex to add space to string in asp.net - asp.net

String we get in our document:
18.1Commitment fee
(a)The Parent shall pay to the Agent a fee in the Base Currency computed at the rate of:
(i)35 per cent. of the Margin per annum on that Commitment under Facility A for the Availability Period applicable to Facility A;
(ii)40 per cent. of the Margin per annum on that Commitment under Facility B for the Availability Period applicable to Facility B;
None of them have space (like...) - Output expected is below:
18.1 Commitment fee
(a) The Parent shall pay......
(i) 35 per cent of the margin....
(ii) 40 per cent of the margin....
How to add sort of case if number, then add space... if (a) then add space, if numerals like (i) add space
Below - Regex.Replace(s, #"^(\d+(?:.\d{1,2})?)(?![\d\s])(.*)", "$1 $2") works on number - provided by Wiktor Stribiżew

Try:
Regex.Replace(s, #"^(\([ivxcdlm]+\)|\([a-z]+\)|\d+\.?\d*)(.*)", "$1 $2", RegexOptions.IgnoreCase)
Regex details:
"^" Assert position at the beginning of a line (at beginning of the string or after a line break character) (line feed)
"(" Match the regex below and capture its match into backreference number 1
Match this alternative (attempting the next alternative only if this one fails)
"\(" Match the character “(” literally
"[ivxcdlm]" Match a single character from the list “ivxcdlm” (case insensitive)
"+" Between one and unlimited times, as many times as possible, giving back as needed (greedy)
"\)" Match the character “)” literally
"|"
Or match this alternative (attempting the next alternative only if this one fails)
"\(" Match the character “(” literally
"[a-z]" Match a single character in the range between “a” and “z” (case insensitive)
"+" Between one and unlimited times, as many times as possible, giving back as needed (greedy)
"\)" Match the character “)” literally
"|"
Or match this alternative (the entire group fails if this one fails to match)
"\d" Match a single character that is a “digit” (any decimal number in any Unicode script)
"+" Between one and unlimited times, as many times as possible, giving back as needed (greedy)
"\." Match the character “.” literally
"?" Between zero and one times, as many times as possible, giving back as needed (greedy)
"\d" Match a single character that is a “digit” (any decimal number in any Unicode script)
"*" Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
")"
"(" Match the regex below and capture its match into backreference number 2
"." Match any single character that is NOT a line break character (line feed)
"*" Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
")"

Related

How to match duplicate letters in interjections [duplicate]

I plan to remove repeated elements (each containing two or more characters) from strings. For example, from "aaa" I expect "aaa", from "aaaa" I expect "aa", from "abababcdcd" I epxect "abcd", from "cdababcdcd" I expect "cdabcd".
I tried gsub("(.{2,})\\1+","\\1",str). It works in cases 1-3, but fails in case 4. How to solve this problem?
SOLUTION
The solution is to rely on the PCRE or ICU regex engines, rather than TRE.
Use either base R gsub with perl=TRUE (it uses PCRE regex engine) and "(?s)(.{2,})\\1+" pattern, or a stringr::str_replace_all() (it uses ICU regex engine) with the same pattern:
> x <- "cdababcdcd"
> gsub("(?s)(.{2,})\\1+", "\\1", x, perl=TRUE)
[1] "cdabcd"
> library(stringr)
> str_replace_all(x, "(?s)(.{2,})\\1+", "\\1")
[1] "cdabcd"
The (?s) flag is necessary for . to match any char including line break chars (in TRE regex, . matches all chars by default).
DETAILS
TRE regex is not good at handling "pathological" cases that are mostly related to backtracking, which directly involves quantifiers (I bolded some parts):
The matching algorithm used in TRE uses linear worst-case time in the length of the text being searched, and quadratic worst-case time in the length of the used regular expression. In other words, the time complexity of the algorithm is O(M2N), where M is the length of the regular expression and N is the length of the text. The used space is also quadratic on the length of the regex, but does not depend on the searched string. This quadratic behaviour occurs only on pathological cases which are probably very rare in practice.
Predictable matching speed
Because of the matching algorithm used in TRE, the maximum time consumed by any regexec() call is always directly proportional to the length of the searched string. There is one exception: if back references are used, the matching may take time that grows exponentially with the length of the string. This is because matching back references is an NP complete problem, and almost certainly requires exponential time to match in the worst case.
In those cases when TRE has trouble calculating all possibilities of matching a string it does not return any match, the string is returned as is. Hence, there is no changes in the gsub call.
As easy as:
gsub("(.{2,})\\1+","\\1",str, perl = T)

What is the regex validation for positive and negative numbers?

I currently have a regex validator to restrict the user to only input numbers greater than 1. How can I allow both positive and negative numbers?
^[1-9]+([0-9]+)*$
Adding -? will do the trick:
^-?[1-9]+([0-9]+)*$
Assuming a negative number will be simply marked by a preceding - sign the following expression should work:
/(^|( )|\t)(-|)\d{1,}/gm
Explanation:
First, (^| ) matches a start of a new line OR (because of the |) a white space (from a space bar) OR a tab. If you have requirements on the white space surrounding the input, you can tweak this regular expression in this section.
Next, (-|) matches either the - character OR (because of the |) nothing
Then it matches a digit \d, where there is at least 1 digit, but possibly an infinite number (because of {1,})
Next, the g sets the global flag allowing more than one instance to be matched.
Finally, the m sets the multi-line flag, allowing matches to span across lines. This is necessary for the ^ new-line character to match properly.
This was tested with the following cases:
-0934 sdj2a
1328 232
-93 2939 -192
Where the matched groups were:
-0934,1328, 232, -93, 2939, -192

Are "unit-relevant" CSS property values with prepended zeroes equivalent to the corresponding "no-zeroes-prepended" values?

I was scanning some stylesheets when I noticed one which used a linear-gradient with rgba() color-stops in which the rgba numbers used multiple instances of 0 instead of just a single 0:
background-image:linear-gradient(to top left, rgba(000,000,000,0.1),rgba(100,100,100,1));
I hadn't seen multiple zeroes (instead of a single zero) occupying a single slot in the rgb/a color space before, but confirmed on CodePen this is valid. I then looked up the W3C definition of number here.
To make a long story short, after some more poking and digging, I didn't realize I could prepend an indeterminate number of zeroes to a length and get the same result as with no zeroes prepended, like this:
/* The two squares generated have equivalent width and height of 100px - for giggles, I also extended the same idea to the transition-duration time */
<style>
div.aaa {
width:00000000100px;
height:100px;
background-image:linear-gradient(to top left,rgba(000,000,000,0.1),rgba(100,100,100,1));
transition:1s cubic-bezier(1,1,1,1)
}
div.bbb {
width:100px;
height:000000000000000000000000000000000100px;
background-color:green;
transition:0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001s cubic-bezier(1,1,1,1)
}
div:hover { background-color:red }
</style>
<div class="aaa"></div>
<div class="bbb"></div>
It's difficult to directly verify these numbers are equivalent representations, because using a scripting language:
/* PHP */
$x = 100;
$y = 00000000000100; // problem is PHP treats this as an octal number
echo ($x == $y) ? 'true' : 'false'; // echoes the string ---> false
/* Javascript */
var x = 100;
var y = 00000000000100; // also treats this as an octal number
var res = (x == y) ? 'true' : 'false';
alert(res); // alerts ---> false
These examples suggest to me that CSS does not treat e.g. 0000100 as an octal number, but rather as a decimal (or at least as non-octal numbers) since the magnitude of the width, height, and transition-duration for the html elements generated above appear to be identical.
Extending this CSS approach to any property and any unit, e.g., time,
Is any unit-containing CSS property value prepended with any positive number of zeroes syntactically equivalent to the same value without any prepended zeroes?
I have to admit I found this question interesting.
https://www.w3.org/TR/CSS21/syndata.html
The css 2 syntax spec says:
num [0-9]+|[0-9]*\.[0-9]+
Note that 000000000000000037.3 meets this rule and definition, a series of numbers between 0 and 9, optionally followed by a . and a further series of numbers from 0 to 9.
The css 3 spec goes on:
https://www.w3.org/TR/css3-values/#numbers
4.2. Real Numbers: the type
Number values are denoted by <number>, and represent real numbers,
possibly with a fractional component.
When written literally, a number is either an integer, or zero or more
decimal digits followed by a dot (.) followed by one or more decimal
digits and optionally an exponent composed of "e" or "E" and an
integer. It corresponds to the production in the CSS
Syntax Module [CSS3SYN]. As with integers, the first character of a
number may be immediately preceded by - or + to indicate the number’s
sign.
https://www.w3.org/TR/css-syntax-3/#convert-a-string-to-a-number
This I believe roughly explains how a css parser is supposed to take the css value and convert it to a number:
4.3.13. Convert a string to a number
This section describes how to convert a string to a number . It
returns a number.
Note: This algorithm does not do any verification to ensure that the
string contains only a number. Ensure that the string contains only a
valid CSS number before calling this algorithm.
Divide the string into seven components, in order from left to right:
A sign: a single U+002B PLUS SIGN (+) or U+002D HYPHEN-MINUS (-), or the empty string. Let s be the number -1 if the sign is U+002D
HYPHEN-MINUS (-); otherwise, let s be the number 1.
An integer part: zero or more digits. If there is at least one digit, let i be the number formed by interpreting the digits as a
base-10 integer; otherwise, let i be the number 0.
A decimal point: a single U+002E FULL STOP (.), or the empty string.
A fractional part: zero or more digits. If there is at least one digit, let f be the number formed by interpreting the digits as a
base-10 integer and d be the number of digits; otherwise, let f and d
be the number 0.
An exponent indicator: a single U+0045 LATIN CAPITAL LETTER E (E) or U+0065 LATIN SMALL LETTER E (e), or the empty string.
(-), or the empty string. Let t be the number -1 if the
sign is U+002D HYPHEN-MINUS (-); otherwise, let t be the number 1.
An exponent: zero or more digits. If there is at least one digit, let e be the number formed by interpreting the digits as a base-10
integer; otherwise, let e be the number 0.
Return the number s·(i + f·10-d)·10te.
I think the key term there is a base-10 number.
Note that for other possible situations where the starting 0 is meaningful, you have to escape it for it to function as something other than a simple number, I believe, if I read this spec right:
https://www.w3.org/TR/css-syntax-3/#escaping
Any Unicode code point can be included in an identifier or quoted
string by escaping it. CSS escape sequences start with a backslash
(\), and continue with:
Any Unicode code point that is not a hex digits or a newline. The escape sequence is replaced by that code point.
Or one to six hex digits, followed by an optional whitespace. The escape sequence is replaced by the Unicode code point whose value is
given by the hexadecimal digits. This optional whitespace allow
hexadecimal escape sequences to be followed by "real" hex digits.
An identifier with the value "&B" could be written as \26 B or \000026B.
A "real" space after the escape sequence must be doubled.
However, even here it appears the starting 0's are optional, though it's not crystal clear.
The CSS specs were while obtuse fairly clear, which isn't always the case. So yes, numbers are made from strings of digits, and can have decimals as well, and are base 10, so that means the leading zeros are simply nothing.
I speculate as well that because the specs further state that no units are required when the number value is 0, that in fact, a leading zero may mean null, nothing, internally, though obviously you'd have to look at css parsing code itself to see how that is actually handled by browsers.
So that's kind of interesting. I think that probably because css is a very simple language, it doesn't do 'clever' things like php or javascript do with leading zeros, it simply does what you'd expect, treat them as zeros, nothing.
Thanks for asking though, sometimes it's nice to go back and read the raw specs just to see how the stuff works.

Regexpression asp.net validator for a few words

I'm trying to create a validator for a string, that may contain 1-N words, which a separated with 1 whitespace (spaces only between words). I'm a newbie in a regex, so I feel a bit confused, cause my expression seem to be correct:
^[[a-zA-Z]+\s{1}]{0,}[a-zA-Z]+$
What am I doing wrong here? (it accepts only 2 words .. but I want it to accept 1+ words)
Any help is greatly appreciated :)
As often happens with someone beginning a new programming language or syntax, you're close, but not quite! The ^ and $ anchors are being used correctly, and the character classes [a-zA-Z] will match only letters (sounds right to me), but your repetition is a little off, and your grouping is not what you think it is - which is your primary problem.
^[[a-zA-Z]+\s{1}]{0,}[a-zA-Z]+$
^ ^^^^^^^^
a bbbacccc
It only matches two words because you effectively don't have any group repetition; this is because you don't really have any groups - only character classes. The simplest fix is to change the first [ and its matching end brace (marked by a's in the listing above) to parentheses:
^([a-zA-Z]+\s{1}){0,}[a-zA-Z]+$
This single change will make it work the way you expect! However, there a few recommendations and considerations I'd like to make.
First, for readability and code maintenance, use the single character repetition operators instead of repetition braces wherever possible. * repeats zero or more times, + repeats one or more times, and ? repeats 0 or one times (AKA optional). Your repetition curly braces are syntactically correct, and do what you intend them to, but one (marked by b's above) should be removed because it is redundant, and the other (marked by c's above) should be shortened to an asterisk *, as they have exactly the same meaning:
^([a-zA-Z]+\s)*[a-zA-z]+$
Second, I would recommend considering (depending upon your application requirements) the \w shorthand character class instead of the [a-zA-Z] character class, with the following considerations:
it matches both upper and lowercase letters
it does match more than letters (it matches digits 0-9 and the underscore as well)
it can often be configured to match non-English (unicode) letters for multi-lingual input
If any of these are unnecessary or undesirable, then you're on the right track!
On a side note, the character combination \b is a word-boundary assertion and is not needed for your case, as you will already begin and end where there are letters and letters only!
As for learning more about regular expressions, I would recommend Regular-Expressions.info, which has a wealth of info about regexes and the inner workings and quirks of the various implementations. I also use a tool called RegexBuddy to test and debug expressions.

Regular expression to match 10-14 digits

I am using regular expressions for matching only digits, minimum 10 digits, maximum 14. I tried:
^[0-9]
I'd give:
^\d{10,14}$
a shot.
I also like to offer extra solutions for RE engines that don't support all that PCRE stuff so, in a pinch, you could use:
^[0-9]{10,14}$
If you're RE engine is so primitive that it doesn't even allow specific repetitions, you'd have to revert to either some ugly hack like fully specifying the number of digits with alternate REs for 10 to 14 or, easier, just checking for:
^[0-9]*$
and ensuring the length was between 10 and 14.
But that won't be needed for this case (ASP.NET).
^\d{10,14}$
regular-expressions.info
Character Classes or Character Sets
\d is short for [0-9]
Limiting Repetition
The syntax is {min,max}, where min is a positive integer number indicating the minimum number of matches, and max is an integer equal to or greater than min indicating the maximum number of matches.
The limited repetition syntax also allows these:
^\d{10,}$ // match at least 10 digits
^\d{13}$ // match exactly 13 digits
try this
#"^\d{10,14}$"
\d - matches a character that is a digit
This will help you
If I understand your question correctly, this should work:
\d{10,14}
Note:
As noted in the other answer.. ^\d{10,14}$ to match the entire input

Resources