I have problem with regular expression. Here is an example text:
"Status: matched: 10:36:08 09/03/2013 from=0.0.0.0:162 oid=1.3.6.1.4.1.11536.3.6.1000 trap= n/a specific= n/a traptime=60 days, 17:39:10.0 community=Cyber-Ark agent=192.118.37.30 version= v2c var1=italog var2= var3=03/09/2013 10:35:37 ITATS426E Safe oniya_gemel is out of space.__"
Which regular expression should I use to match everything after “var3 + out of space”. I need the whole sentence as match: “var3=03/09/2013 10:35:37 ITATS426E Safe oniya_gemel is out of space.__”
I have a regular expression toll and used
/(var3=)*(out of space)/
so far, but it matches only “out of space”.
Any input would be much appreciated!
Thank you in advance!!!!
Vesec
You need to specify that there are characters between var3= and out of space.
/(var3=).*(out of space)/
The star states "Previous symbol repeated 0 or more times". So your regexp looked for 0 or more repetitions of "var3=" immidiately followed by "out of space". Ie. the star affected the "var3=" part, rather than indicating that there should be characters between that and "out of space".
"." in regexp matches any character, so my proposal states "var3=", followed by 0 or more characters, followed by "out of space".
Related
This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 4 years ago.
I'm trying to understand a regular expression someone has written in the gsub() function.
I've never used regular expressions before seeing this code, and i have tried to work out how it's getting the final result with some googling, but i have hit a wall so to speak.
gsub('.*(.{2}$)', '\\1',"my big fluffy cat")
This code returns the last two characters in the given string. In the above example it would return "at". This is the expected result but from my brief foray into regular expressions i don't understand why this code does what it does.
What i understand is the '.*' means look for any character 0 or more times. So it's going to look at the entire string and this is what will be replaced.
The part in brackets looks for any two characters at the end of the string. It would make more sense to me if this part in brackets was in place of the '\1'. To me it would then read look at the entire string and replace it with the last two characters of that string.
All that does though is output the actual code as the replacement e.g ".{2}$".
Finally i don't understand why '\1' is in the replace part of the function. To me this is just saying replace the entire string with a single backslash and the number one. I say a single backslash because it's my understanding the first backslash is just there to make the second backslash a none special character.
For gsub there are two ways of using the function. The most common way is probably.
gsub("-","TEST","This is a - ")
which would return
This is a TEST
What this does is simply finds the matches in the regular expression and replaces it with the replacement string.
The second way to use gsub is the method in which you described. using \\1, \\2 or \\3...
What this does is looks at the first, second or third capture group in your regular expression.
A capture group is defined by anything inside the circular brackets ex: (capture_group_1)(capture_group_2)...
Explanation
Your analysis is correct.
What i understand is the '.*' means look for any character 0 or more times. So it's going to look at the entire string and this is what will be replaced.
The part in brackets looks for any two characters at the end of the string
The last two characters are placed in a capture group and we are simply replace the whole string with this capture group. Not replacing them with anything.
if it helps, check out the result of this expression.
gsub('(.*)(.{2}$)', 'Group 1: \\1, Group 2: \\2',"my big fluffy cat")
hope the examples can help you to understand it better:
Say we have a string foobarabcabcdef
.* matches whole string.
.*abc it matches: from the beginning matches any chars till the last abc (greedy matching), thus, it matches foobarabcabc
.*(...)$ matches the whole string as well, however, the last 3 chars were groupped. Without the () , the matched string will have a default group, group0, the () will be group1, 2, 3.... think about .*(...)(...)(...)$ so we have:
group 0 : whole string
group 1 : "abc" the first "abc"
group 2 : "abc" the 2nd "abc"
group 3 : "def" the last 3 chars
So back to your example, the \\1 is a reference to group. What it does is: "replace the whole string by the matched text in group1" That is, the .{2}$ part is the replacement.
If you don't understand the backslashs, you have to reference the syntax of r, I cannot tell more. It is all about escaping.
Important part of that regular expression are brackets, that's something called "capturing group".
Regular expression .*(.{2}$) says - match anything and capture last 2 characters at the line. Replacement \\1 is referencing to that group, so it will replace whole match with captured group, which are last two characters in this case.
I'm trying to extract UK postcodes from address strings in R, using the regular expression provided by the UK government here.
Here is my function:
address_to_postcode <- function(addresses) {
# 1. Convert addresses to upper case
addresses = toupper(addresses)
# 2. Regular expression for UK postcodes:
pcd_regex = "[Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) {0,1}[0-9][A-Za-z]{2})"
# 3. Check if a postcode is present in each address or not (return TRUE if present, else FALSE)
present <- grepl(pcd_regex, addresses)
# 4. Extract postcodes matching the regular expression for a valid UK postcode
postcodes <- regmatches(addresses, regexpr(pcd_regex, addresses))
# 5. Return NA where an address does not contain a (valid format) UK postcode
postcodes_out <- list()
postcodes_out[present] <- postcodes
postcodes_out[!present] <- NA
# 6. Return the results in a vector (should be same length as input vector)
return(do.call(c, postcodes_out))
}
According to the guidance document, the logic this regular expression looks for is as follows:
"GIR 0AA" OR One letter followed by either one or two numbers OR One letter followed by a second letter that must be one of
ABCDEFGHJ KLMNOPQRSTUVWXY (i.e..not I) and then followed by either one
or two numbers OR One letter followed by one number and then another
letter OR A two part post code where the first part must be One letter
followed by a second letter that must be one of ABCDEFGH
JKLMNOPQRSTUVWXY (i.e..not I) and then followed by one number and
optionally a further letter after that AND The second part (separated
by a space from the first part) must be One number followed by two
letters. A combination of upper and lower case characters is allowed.
Note: the length is determined by the regular expression and is
between 2 and 8 characters.
My problem is that this logic is not completely preserved when using the regular expression without the ^ and $ anchors (as I have to do in this scenario because the postcode could be anywhere within the address strings); what I'm struggling with is how to preserve the order and number of characters for each segment in a partial (as opposed to complete) string match.
Consider the following example:
> address_to_postcode("1A noplace road, random city, NR1 2PK, UK")
[1] "NR1 2PK"
According to the logic in the guideline, the second letter in the postcode cannot be 'z' (and there are some other exclusions too); however look what happens when I add a 'z':
> address_to_postcode("1A noplace road, random city, NZ1 2PK, UK")
[1] "Z1 2PK"
... whereas in this case I would expect the output to be NA.
Adding the anchors (for a different usage case) doesn't seem to help as the 'z' is still accepted even though it is in the wrong place:
> grepl("^[Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) {0,1}[0-9][A-Za-z]{2})$", "NZ1 2PK")
[1] TRUE
Two questions:
Have I misunderstood the logic of the regular expression and
If not, how can I correct it (i.e. why aren't the specified letter
and character ranges exclusive to their position within the regular expression)?
Edit
Since posting this answer, I dug deeper into the UK government's regex and found even more problems. I posted another answer here that describes all the issues and provides alternatives to their poorly formatted regex.
Note
Please note that I'm posting the raw regex here. You'll need to escape certain characters (like backslashes \) when porting to r.
Issues
You have many issues here, all of which are caused by whoever created the document you're retrieving your regex from or the coder that created it.
1. The space character
My guess is that when you copied the regular expression from the link you provided it converted the space character into a newline character and you removed it (that's exactly what I did at first). You need to, instead, change it to a space character.
^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([AZa-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2})$
here ^
2. Boundaries
You need to remove the anchors ^ and $ as these indicate start and end of line. Instead, wrap your regex in (?:) and place a \b (word boundary) on either end as the following shows. In fact, the regex in the documentation is incorrect (see Side note for more information) as it will fail to anchor the pattern properly.
See regex in use here
\b(?:([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([AZa-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2}))\b
^^^^^ ^^^
3. Character class oversight
There's a missing - in the character class as pointed out by #deadcrab in his answer here.
\b(?:([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2}))\b
^
4. They made the wrong character class optional!
In the documentation it clearly states:
A two part post code where the first part must be:
One letter followed by a second letter that must be one of ABCDEFGHJKLMNOPQRSTUVWXY (i.e..not I) and then followed by one number and optionally a further letter after that
They made the wrong character class optional!
\b(?:([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2}))\b
^^^^^^
it should be this one ^^^^^^^^
5. The whole thing is just awful...
There are so many things wrong with this regex that I just decided to rewrite it. It can very easily be simplified to perform a fraction of the steps it currently takes to match text.
\b(?:[A-Za-z][A-HJ-Ya-hj-y]?[0-9][0-9A-Za-z]? [0-9][A-Za-z]{2}|[Gg][Ii][Rr] 0[Aa]{2})\b
Answer
As mentioned in the comments below my answer, some postcodes are missing the space character. For missing spaces in the postcodes (e.g. NR12PK), simply add a ? after the spaces as shown in the regex below:
\b(?:[A-Za-z][A-HJ-Ya-hj-y]?[0-9][0-9A-Za-z]? ?[0-9][A-Za-z]{2}|[Gg][Ii][Rr] ?0[Aa]{2})\b
^^ ^^
You may also shorten the regex above with the following and use the case-insensitive flag (ignore.case(pattern) or ignore_case = TRUE in r, depending on the method used.):
\b(?:[A-Z][A-HJ-Y]?[0-9][0-9A-Z]? ?[0-9][A-Z]{2}|GIR ?0A{2})\b
Note
Please note that regular expressions only validate the possible format(s) of a string and cannot actually identify whether or not a postcode legitimately exists. For this, you should use an API. There are also some edge-cases where this regex will not properly match valid postcodes. For a list of these postcodes, please see this Wikipedia article.
The regex below additionally matches the following (make it case-insensitive to match lowercase variants as well):
British Overseas Territories
British Forces Post Office
Although they've recently changed it to align with the British postcode system to BF, followed by a number (starting with BF1), they're considered optional alternative postcodes
Special cases outlined in that article (as well as SAN TA1 - a valid postcode for Santa!)
See this regex in use here.
\b(?:(?:[A-Z][A-HJ-Y]?[0-9][0-9A-Z]?|ASCN|STHL|TDCU|BBND|[BFS]IQ{2}|GX11|PCRN|TKCA) ?[0-9][A-Z]{2}|GIR ?0A{2}|SAN ?TA1|AI-?[0-9]{4}|BFPO[ -]?[0-9]{2,3}|MSR[ -]?1(?:1[12]|[23][135])0|VG[ -]?11[1-6]0|[A-Z]{2} ? [0-9]{2}|KY[1-3][ -]?[0-2][0-9]{3})\b
I would also recommend anyone implementing this answer to read this StackOverflow question titled UK Postcode Regex (Comprehensive).
Side note
The documentation you linked to (Bulk Data Transfer: Additional Validation for CAS Upload - Section 3. UK Postcode Regular Expression) actually has an improperly written regular expression.
As mentioned in the Issues section, they should have:
Wrapped the entire expression in (?:) and placed the anchors around the non-capturing group. Their regular expression, as it stands, will fail in for some cases as seen here.
The regular expression is also missing - in one of the character classes
It also made the wrong character class optional.
here is my regular expression
txt="0288, Bishopsgate, London Borough of Tower Hamlets, London, Greater London, England, EC2M 4QP, United Kingdom"
matches=re.findall(r'[A-Z]{1,2}[0-9][A-Z0-9]? [0-9][ABD-HJLNP-UW-Z]{2}', txt)
I'm trying to build a somewhat REGEX expression of the of only numbers including decimal with a maximum of 3 numbers to the right of the decimal (thousandths) and 50 to the left. Valid entries would like something like these.
1
1.0
.1
1.011
.011
1202938.123
1237923782.0
So far I have ^([0-9]*|\d*\.\d{1}?\d*){1,999}$.. Any help appreciated. Thanks.
I believe this should suffice:
^(?=.)\d{0,50}(?:\.\d{0,3})?$
See the regex demo. Note this will also match 1., if this is undesired change \d{0,3} to \d{1,3}. Similarely, this regex will match .5 (with no integer part), if you dont want this then use \d{1,50} instead of \d{0,50}.
You could try:
^(?=.+)\d{0,50}(?:\.\d{1,3})?$
Demonstration here at regex101.com
Explanation -
^ tells the regex that the match will begin at the start of the string,
\d{0, 50} matches 0 - 50 digits,
(?=.+) is a positive look-ahead, that tells the regex that the matching should only start if the line contains some characters in it (as rightly pointed out in the comments!),
(?:\.\d{1,3})? matches an optional dot (.), followed by 1 - 3 digits,
$ tells the regex that whatever it has matched so far will be followed by the end of the string.
Other way: You can check if the string isn't empty and if the dot is always followed by digits, putting a word-boundary at a strategic place:
^\d{0,50}\.?\b\d{0,3}$
As you can see, all is optional in the pattern except the word-boundary that does the magic.
demo
I'm using the ASP.NET RegularExpressionValidator
I need a regular expression to keep users who fill out a form from using all caps.
For example, if they write their name:
Bob JONES or BOB JONES or BOB JOnes or whatever, it will not match.
I am able to match all caps with this regular expression:
[A-Z]{2,10}
But the RegularExpressionValidator requires me to match valid text, not invalid text.
If your goal is to have each word have no more than 1 capital letter in a row at a time, and assuming it's okay to restrict to ASCII letters, try something like this:
^(?:[a-z]|[A-Z](?![A-Z])|['-])+$
In other words, the string must be entirely composed of either lowercase letters, or uppercase letters not followed by another uppercase letter.
This works for single words. For multiple words (like a full name, first and last), simply add a space to the alternation:
^(?:[a-z]|[A-Z](?![A-Z])|[\s'-])+$
(Edited to allow apostrophe and hyphen punctuation)
use this Regex: #"^[^A-Z]*$" It will match anything that not contains upper case characters.
use this regular expression ^[a-z ]+$
if you want catch names like Bob Jones use this one ^([A-Z][a-z ]+)+$
Maybe i'm just stating the obvious, but couldn't you just to myVar.string.toLower before doing the Compare?
I have a regular expression
^[a-zA-Z+#-.0-9]{1,5}$
which validates that the word contains alpha-numeric characters and few special characters and length should not be more than 5 characters.
How do I make this regular expression to accept a maximum of five words matching the above regular expression.
^[a-zA-Z+#\-.0-9]{1,5}(\s[a-zA-Z+#\-.0-9]{1,5}){0,4}$
Also, you could use for example [ ] instead of \s if you just want to accept space, not tab and newline. And you could write [ ]+ (or \s+) for any number of spaces (or whitespaces), not just one.
Edit: Removed the invalid solution and fixed the bug mentioned by unicornaddict.
I believe this may be what you're looking for. It forces at least one word of your desired pattern, then zero to four of the same, each preceded by one or more white-space characters:
^XX(\s+XX){0,4}$
where XX is your actual one-word regex.
It's separated into two distinct sections so that you're not required to have white-space at the end of the string. If you want to allow for such white-space, simply add \s* at that point. For example, allowing white-space both at start and end would be:
^\s*XX(\s+XX){0,4}\s*$
You regex has a small bug. It matches letters, digits, +, #, period but not hyphen and also all char between # and period. This is because hyphen in a char class when surrounded on both sides acts as a range meta char. To avoid this you'll have to escape the hyphen:
^[a-zA-Z+#\-.0-9]{1,5}$
Or put it at the beg/end of the char class, so that its treated literally:
^[-a-zA-Z+#-.0-9]{1,5}$
^[a-zA-Z+#.0-9-]{1,5}$
Now to match a max of 5 such words you can use:
^(?:[a-zA-Z+#\-.0-9]{1,5}\s+){1,5}$
EDIT: This solution has a severe limitation of matching only those input that end in white space!!! To overcome this limitation you can see the ans by Jakob.