While working on Google Analytics,I have noticed the below regex pattern with multiple regex symbol .
^(sm|social network|social media)$
Could you please help me to understand what it means.
Thanks,
Aneesh
Pattern,that using this symbols - ^ is describes start of the line, $ - end of line.
Symbols that written in the middle of this patterns describes matches with whole line.
| symbol equal to OR.
At the end,whole line is matches exactly with sm or social network or social media
this expression matches any of string "sm", "social network", "social media".
^ points to a start of string and $ to the end of the string so this expression wouldn't match social networks.
you can use regex101 to get an explanation of some heavy regexes: https://regex101.com/r/Ebup4n/1
Related
I am currently trying to pull data from a pdf using the str_match function which is working well. This is an example:
values[[18]] <- str_match(Sprout_textNoLines, "Business Description: (.*?) Renter or Owned:")[,2]
Sprout_textNoLines is just a paragraph of all the characters in the pdf, not separated by lines. The particular case that I'm parsing here is
Business Description: Federal and State Construction Renter or Owned:
The str_match call that I showed earlier returns "Federal and State Construction" which is exactly what I need. However, I am finding cases where some of the pdfs are different and the inputs on the lines won't be separated by a space for example:
Business Description:Federal and State Construction Renter or Owned:
There is no space between Description: and Federal here so the earlier function call will just pull back NA here because Business Description: (.*?) Renter or Owned:. I need to automate this process so is there a regex that could accomplish something similar to
values[[18]] <- str_match(Sprout_textNoLines, "Business Description: (.*?) Renter or Owned:")[,2]
but with adding regex to the (.*?) to account for variability in the amount of spaces between the string that I want to pull and the strings that precede and follow it?
You may use
str_match(Sprout_textNoLines, "Business Description:\\s*(.*?)\\s*Renter or Owned:")[,2]
See the regex demo
The part that is changed is \s*(.*?)\s* that matches 0 or more whitespaces (\s*), then captures any 0 or more chars other than line break chars as few as possible, and then again 0 or more whitespaces are matched.
I'm trying to extract UK postcodes from address strings in R, using the regular expression provided by the UK government here.
Here is my function:
address_to_postcode <- function(addresses) {
# 1. Convert addresses to upper case
addresses = toupper(addresses)
# 2. Regular expression for UK postcodes:
pcd_regex = "[Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) {0,1}[0-9][A-Za-z]{2})"
# 3. Check if a postcode is present in each address or not (return TRUE if present, else FALSE)
present <- grepl(pcd_regex, addresses)
# 4. Extract postcodes matching the regular expression for a valid UK postcode
postcodes <- regmatches(addresses, regexpr(pcd_regex, addresses))
# 5. Return NA where an address does not contain a (valid format) UK postcode
postcodes_out <- list()
postcodes_out[present] <- postcodes
postcodes_out[!present] <- NA
# 6. Return the results in a vector (should be same length as input vector)
return(do.call(c, postcodes_out))
}
According to the guidance document, the logic this regular expression looks for is as follows:
"GIR 0AA" OR One letter followed by either one or two numbers OR One letter followed by a second letter that must be one of
ABCDEFGHJ KLMNOPQRSTUVWXY (i.e..not I) and then followed by either one
or two numbers OR One letter followed by one number and then another
letter OR A two part post code where the first part must be One letter
followed by a second letter that must be one of ABCDEFGH
JKLMNOPQRSTUVWXY (i.e..not I) and then followed by one number and
optionally a further letter after that AND The second part (separated
by a space from the first part) must be One number followed by two
letters. A combination of upper and lower case characters is allowed.
Note: the length is determined by the regular expression and is
between 2 and 8 characters.
My problem is that this logic is not completely preserved when using the regular expression without the ^ and $ anchors (as I have to do in this scenario because the postcode could be anywhere within the address strings); what I'm struggling with is how to preserve the order and number of characters for each segment in a partial (as opposed to complete) string match.
Consider the following example:
> address_to_postcode("1A noplace road, random city, NR1 2PK, UK")
[1] "NR1 2PK"
According to the logic in the guideline, the second letter in the postcode cannot be 'z' (and there are some other exclusions too); however look what happens when I add a 'z':
> address_to_postcode("1A noplace road, random city, NZ1 2PK, UK")
[1] "Z1 2PK"
... whereas in this case I would expect the output to be NA.
Adding the anchors (for a different usage case) doesn't seem to help as the 'z' is still accepted even though it is in the wrong place:
> grepl("^[Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) {0,1}[0-9][A-Za-z]{2})$", "NZ1 2PK")
[1] TRUE
Two questions:
Have I misunderstood the logic of the regular expression and
If not, how can I correct it (i.e. why aren't the specified letter
and character ranges exclusive to their position within the regular expression)?
Edit
Since posting this answer, I dug deeper into the UK government's regex and found even more problems. I posted another answer here that describes all the issues and provides alternatives to their poorly formatted regex.
Note
Please note that I'm posting the raw regex here. You'll need to escape certain characters (like backslashes \) when porting to r.
Issues
You have many issues here, all of which are caused by whoever created the document you're retrieving your regex from or the coder that created it.
1. The space character
My guess is that when you copied the regular expression from the link you provided it converted the space character into a newline character and you removed it (that's exactly what I did at first). You need to, instead, change it to a space character.
^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([AZa-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2})$
here ^
2. Boundaries
You need to remove the anchors ^ and $ as these indicate start and end of line. Instead, wrap your regex in (?:) and place a \b (word boundary) on either end as the following shows. In fact, the regex in the documentation is incorrect (see Side note for more information) as it will fail to anchor the pattern properly.
See regex in use here
\b(?:([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([AZa-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2}))\b
^^^^^ ^^^
3. Character class oversight
There's a missing - in the character class as pointed out by #deadcrab in his answer here.
\b(?:([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2}))\b
^
4. They made the wrong character class optional!
In the documentation it clearly states:
A two part post code where the first part must be:
One letter followed by a second letter that must be one of ABCDEFGHJKLMNOPQRSTUVWXY (i.e..not I) and then followed by one number and optionally a further letter after that
They made the wrong character class optional!
\b(?:([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2}))\b
^^^^^^
it should be this one ^^^^^^^^
5. The whole thing is just awful...
There are so many things wrong with this regex that I just decided to rewrite it. It can very easily be simplified to perform a fraction of the steps it currently takes to match text.
\b(?:[A-Za-z][A-HJ-Ya-hj-y]?[0-9][0-9A-Za-z]? [0-9][A-Za-z]{2}|[Gg][Ii][Rr] 0[Aa]{2})\b
Answer
As mentioned in the comments below my answer, some postcodes are missing the space character. For missing spaces in the postcodes (e.g. NR12PK), simply add a ? after the spaces as shown in the regex below:
\b(?:[A-Za-z][A-HJ-Ya-hj-y]?[0-9][0-9A-Za-z]? ?[0-9][A-Za-z]{2}|[Gg][Ii][Rr] ?0[Aa]{2})\b
^^ ^^
You may also shorten the regex above with the following and use the case-insensitive flag (ignore.case(pattern) or ignore_case = TRUE in r, depending on the method used.):
\b(?:[A-Z][A-HJ-Y]?[0-9][0-9A-Z]? ?[0-9][A-Z]{2}|GIR ?0A{2})\b
Note
Please note that regular expressions only validate the possible format(s) of a string and cannot actually identify whether or not a postcode legitimately exists. For this, you should use an API. There are also some edge-cases where this regex will not properly match valid postcodes. For a list of these postcodes, please see this Wikipedia article.
The regex below additionally matches the following (make it case-insensitive to match lowercase variants as well):
British Overseas Territories
British Forces Post Office
Although they've recently changed it to align with the British postcode system to BF, followed by a number (starting with BF1), they're considered optional alternative postcodes
Special cases outlined in that article (as well as SAN TA1 - a valid postcode for Santa!)
See this regex in use here.
\b(?:(?:[A-Z][A-HJ-Y]?[0-9][0-9A-Z]?|ASCN|STHL|TDCU|BBND|[BFS]IQ{2}|GX11|PCRN|TKCA) ?[0-9][A-Z]{2}|GIR ?0A{2}|SAN ?TA1|AI-?[0-9]{4}|BFPO[ -]?[0-9]{2,3}|MSR[ -]?1(?:1[12]|[23][135])0|VG[ -]?11[1-6]0|[A-Z]{2} ? [0-9]{2}|KY[1-3][ -]?[0-2][0-9]{3})\b
I would also recommend anyone implementing this answer to read this StackOverflow question titled UK Postcode Regex (Comprehensive).
Side note
The documentation you linked to (Bulk Data Transfer: Additional Validation for CAS Upload - Section 3. UK Postcode Regular Expression) actually has an improperly written regular expression.
As mentioned in the Issues section, they should have:
Wrapped the entire expression in (?:) and placed the anchors around the non-capturing group. Their regular expression, as it stands, will fail in for some cases as seen here.
The regular expression is also missing - in one of the character classes
It also made the wrong character class optional.
here is my regular expression
txt="0288, Bishopsgate, London Borough of Tower Hamlets, London, Greater London, England, EC2M 4QP, United Kingdom"
matches=re.findall(r'[A-Z]{1,2}[0-9][A-Z0-9]? [0-9][ABD-HJLNP-UW-Z]{2}', txt)
How would I go about building a regex that allows only digits, with no spaces, and an optional "+" at the beginning?
try this
^\+?\d+$
^ anchors it to the start of the string, $ to the end
\+? is the optional +
\d is a digit and the following + is the quantifier that says at least one (digit).
A useful resource to learn regular expressions is the tutorial of regular-expressions.info
And Regexr is a very useful resource to test regular expressions, see this regex here online
This one should work: ^\+?\d+$
You need to match a +,maybe, followed by digits. The + is a special character, so you need to escape it. To match a telephone number on its own (nothing else in the string) do ^\+?\d+$, to match it in a larger string omit the ^ and $ for just \+?\d+. You can obviously also change \d+ to \d{7} if you know how many digits there should be.
I'm using the following:
(^\+?[0-9]{10,15})$
The + in the beginning is optional as indicated above, with added length restrictions (being minimum 10 digits & maximum 15)
I need to find the regex for []
For eg, if the string is - Hi [Stack], Here is my [Tag] which i need to [Find].
It should return
Stack, Tag, Find
Pretty simple, you just need to (1) escape the brackets with backslashes, and (2) use (.*?) to capture the contents.
\[(.*?)\]
The parentheses are a capturing group, they capture their contents for later use. The question mark after .* makes the matching non-greedy. This means it will match the shortest match possible, rather than the longest one. The difference between greedy and non-greedy comes up when you have multiple matches in a line:
Hi [Stack], Here is my [Tag] which i need to [Find].
^______________________________________________^
A greedy match will find the longest string possible between two sets of square brackets. That's not right. A non-greedy match will find the shortest:
Hi [Stack], Here is my [Tag] which i need to [Find].
^_____^
Anyways, the code will end up looking like:
string regex = #"\[(.*?)\]";
string text = "Hi [Stack], Here is my [Tag] which i need to [Find].";
foreach (Match match in Regex.Matches(text, regex))
{
Console.WriteLine("Found {0}", match.Groups[1].Value);
}
\[([\w]+?)\]
should work. You might have to change the matching group if you need to include special chars as well.
Depending on what environment you mean:
\[([^\]]+)]
.NET syntax, taking care of multiple embedded brackets:
\[ ( (?: \\. | (?<OPEN> \[) | (?<-OPEN> \]) | [^\]] )*? (?(OPEN)(?!)) ) \]
This counts the number of opened [ sections in OPEN and only succeeds if OPEN is 0 in the end.
I encountered a similar issue and discovered that this also does the trick.
\[\w{1,}\]
The \w means Metacharacter. This will match 1 or more word characters.
Using n{X,} quantifier matches any string where you can obtain different amounts. With the second number left out on purpose, the expression means 1 or more characters to match.
I'm trying to create a validation expression that checks the length of an input and allows text and punctuation marks (e.g. , ? ; : ! " £ $ % )
What I have come up with so far is "^\s*(\w\s*){1,2046}\s*$" but this won't allow any punctuation marks. To be honest I'm pretty sketchy in this area so any help would be greatly appreciated!
Thanks,
Steve
^[\w\s.,:;!?€¥£¢$-]{0,2048}$
^ -- Beginning of string/line
[] -- A character class
\w -- A word character
\s -- A space character
.,:;!?€¥£¢$- -- Punctuation and special characters
{} -- Number of repeats (min,max)
$ -- End of string/line
If you're looking to allow text and punctuation what are you looking to exclude? Digits?
\D will give you everything that isn't a digit
You may already know this, but: guarding against malicious input should be handled server side, not in form validation on the client side. Black hats won't bat an eye at bypassing your script.
I think with most popular web front end frameworks there is library code for scrubbing input. A short regex alone is fairly flimsy for guarding against a SQL injection attack.
This should do it:
^\s*([\w,\?;:!"£$%]\s*){1,2046}$
Note that this doesn't limit the length of the input at all, it only limits the number of non-white-space characters.
To limit the length, you can use a positive lookahead that only matches a specific length range:
^(?=.{1,2046}$)\s*([\w,\?;:!"£$%]\s*)+$
(The upper limit on the number of non-white-space characters is pointless if it's the same as the length. The + is short for {1,}, requiring at least one non-white-space character.)
This regular expression should match all your characters and limit the input:
^\s*([\w\s\?\;\:\!\"£\$%]{1,2046})\s*$