Extract mm/dd/yyyy and m/dd/yyyy dates from string in R [duplicate] - r

My regex pattern looks something like
<xxxx location="file path/level1/level2" xxxx some="xxx">
I am only interested in the part in quotes assigned to location. Shouldn't it be as easy as below without the greedy switch?
/.*location="(.*)".*/
Does not seem to work.

You need to make your regular expression lazy/non-greedy, because by default, "(.*)" will match all of "file path/level1/level2" xxx some="xxx".
Instead you can make your dot-star non-greedy, which will make it match as few characters as possible:
/location="(.*?)"/
Adding a ? on a quantifier (?, * or +) makes it non-greedy.
Note: this is only available in regex engines which implement the Perl 5 extensions (Java, Ruby, Python, etc) but not in "traditional" regex engines (including Awk, sed, grep without -P, etc.).

location="(.*)" will match from the " after location= until the " after some="xxx unless you make it non-greedy.
So you either need .*? (i.e. make it non-greedy by adding ?) or better replace .* with [^"]*.
[^"] Matches any character except for a " <quotation-mark>
More generic: [^abc] - Matches any character except for an a, b or c

How about
.*location="([^"]*)".*
This avoids the unlimited search with .* and will match exactly to the first quote.

Use non-greedy matching, if your engine supports it. Add the ? inside the capture.
/location="(.*?)"/

Use of Lazy quantifiers ? with no global flag is the answer.
Eg,
If you had global flag /g then, it would have matched all the lowest length matches as below.

Here's another way.
Here's the one you want. This is lazy [\s\S]*?
The first item:
[\s\S]*?(?:location="[^"]*")[\s\S]* Replace with: $1
Explaination: https://regex101.com/r/ZcqcUm/2
For completeness, this gets the last one. This is greedy [\s\S]*
The last item:[\s\S]*(?:location="([^"]*)")[\s\S]*
Replace with: $1
Explaination: https://regex101.com/r/LXSPDp/3
There's only 1 difference between these two regular expressions and that is the ?

The other answers here fail to spell out a full solution for regex versions which don't support non-greedy matching. The greedy quantifiers (.*?, .+? etc) are a Perl 5 extension which isn't supported in traditional regular expressions.
If your stopping condition is a single character, the solution is easy; instead of
a(.*?)b
you can match
a[^ab]*b
i.e specify a character class which excludes the starting and ending delimiiters.
In the more general case, you can painstakingly construct an expression like
start(|[^e]|e(|[^n]|n(|[^d])))end
to capture a match between start and the first occurrence of end. Notice how the subexpression with nested parentheses spells out a number of alternatives which between them allow e only if it isn't followed by nd and so forth, and also take care to cover the empty string as one alternative which doesn't match whatever is disallowed at that particular point.
Of course, the correct approach in most cases is to use a proper parser for the format you are trying to parse, but sometimes, maybe one isn't available, or maybe the specialized tool you are using is insisting on a regular expression and nothing else.

Because you are using quantified subpattern and as descried in Perl Doc,
By default, a quantified subpattern is "greedy", that is, it will
match as many times as possible (given a particular starting location)
while still allowing the rest of the pattern to match. If you want it
to match the minimum number of times possible, follow the quantifier
with a "?" . Note that the meanings don't change, just the
"greediness":
*? //Match 0 or more times, not greedily (minimum matches)
+? //Match 1 or more times, not greedily
Thus, to allow your quantified pattern to make minimum match, follow it by ? :
/location="(.*?)"/

import regex
text = 'ask her to call Mary back when she comes back'
p = r'(?i)(?s)call(.*?)back'
for match in regex.finditer(p, str(text)):
print (match.group(1))
Output:
Mary

Related

Extract a certain element from URL using regular expressions

I need to extract the first element ("adidas-originals") after "designer" in the following URL using regular expressions.
xxx/en-ca/men/designers/adidas-originals/shorts
This needs to be done in Google Big Query API (standard SQL). To this end, I have tried several ways to get the desired valued without any success. Below is the best solution that I have found so far which obviously is not the right one as it returns "/adidas-originals/shorts".
REGEXP_EXTRACT(hits.page.pagePath, r'designers([^\n]*)')
Thanks!
The [^\n]* matches 0 or more chars other than a newline, LF, so no wonder it matches too much.
You need a pattern to match up to the next /, so you may use
designers/([^/]+)
Or a more precise:
(?:^|/)designers/([^/]+)
See the regex demo
Details
(?:^|/) - either start of a string or / (you may just use / if designers is always preceded with /)
designers/ a designers/ substring
([^/]+) - Capturing group 1 (just what will be returned with the REGEXP_EXTRACT function): one or more chars other than /.

sub command to extract data and split data frame column [duplicate]

Simple regex question. I have a string on the following format:
this is a [sample] string with [some] special words. [another one]
What is the regular expression to extract the words within the square brackets, ie.
sample
some
another one
Note: In my use case, brackets cannot be nested.
You can use the following regex globally:
\[(.*?)\]
Explanation:
\[ : [ is a meta char and needs to be escaped if you want to match it literally.
(.*?) : match everything in a non-greedy way and capture it.
\] : ] is a meta char and needs to be escaped if you want to match it literally.
(?<=\[).+?(?=\])
Will capture content without brackets
(?<=\[) - positive lookbehind for [
.*? - non greedy match for the content
(?=\]) - positive lookahead for ]
EDIT: for nested brackets the below regex should work:
(\[(?:\[??[^\[]*?\]))
This should work out ok:
\[([^]]+)\]
Can brackets be nested?
If not: \[([^]]+)\] matches one item, including square brackets. Backreference \1 will contain the item to be match. If your regex flavor supports lookaround, use
(?<=\[)[^]]+(?=\])
This will only match the item inside brackets.
To match a substring between the first [ and last ], you may use
\[.*\] # Including open/close brackets
\[(.*)\] # Excluding open/close brackets (using a capturing group)
(?<=\[).*(?=\]) # Excluding open/close brackets (using lookarounds)
See a regex demo and a regex demo #2.
Use the following expressions to match strings between the closest square brackets:
Including the brackets:
\[[^][]*] - PCRE, Python re/regex, .NET, Golang, POSIX (grep, sed, bash)
\[[^\][]*] - ECMAScript (JavaScript, C++ std::regex, VBA RegExp)
\[[^\]\[]*] - Java, ICU regex
\[[^\]\[]*\] - Onigmo (Ruby, requires escaping of brackets everywhere)
Excluding the brackets:
(?<=\[)[^][]*(?=]) - PCRE, Python re/regex, .NET (C#, etc.), JGSoft Software
\[([^][]*)] - Bash, Golang - capture the contents between the square brackets with a pair of unescaped parentheses, also see below
\[([^\][]*)] - JavaScript, C++ std::regex, VBA RegExp
(?<=\[)[^\]\[]*(?=]) - Java regex, ICU (R stringr)
(?<=\[)[^\]\[]*(?=\]) - Onigmo (Ruby, requires escaping of brackets everywhere)
NOTE: * matches 0 or more characters, use + to match 1 or more to avoid empty string matches in the resulting list/array.
Whenever both lookaround support is available, the above solutions rely on them to exclude the leading/trailing open/close bracket. Otherwise, rely on capturing groups (links to most common solutions in some languages have been provided).
If you need to match nested parentheses, you may see the solutions in the Regular expression to match balanced parentheses thread and replace the round brackets with the square ones to get the necessary functionality. You should use capturing groups to access the contents with open/close bracket excluded:
\[((?:[^][]++|(?R))*)] - PHP PCRE
\[((?>[^][]+|(?<o>)\[|(?<-o>]))*)] - .NET demo
\[(?:[^\]\[]++|(\g<0>))*\] - Onigmo (Ruby) demo
If you do not want to include the brackets in the match, here's the regex: (?<=\[).*?(?=\])
Let's break it down
The . matches any character except for line terminators. The ?= is a positive lookahead. A positive lookahead finds a string when a certain string comes after it. The ?<= is a positive lookbehind. A positive lookbehind finds a string when a certain string precedes it. To quote this,
Look ahead positive (?=)
Find expression A where expression B follows:
A(?=B)
Look behind positive (?<=)
Find expression A where expression B
precedes:
(?<=B)A
The Alternative
If your regex engine does not support lookaheads and lookbehinds, then you can use the regex \[(.*?)\] to capture the innards of the brackets in a group and then you can manipulate the group as necessary.
How does this regex work?
The parentheses capture the characters in a group. The .*? gets all of the characters between the brackets (except for line terminators, unless you have the s flag enabled) in a way that is not greedy.
Just in case, you might have had unbalanced brackets, you can likely design some expression with recursion similar to,
\[(([^\]\[]+)|(?R))*+\]
which of course, it would relate to the language or RegEx engine that you might be using.
RegEx Demo 1
Other than that,
\[([^\]\[\r\n]*)\]
RegEx Demo 2
or,
(?<=\[)[^\]\[\r\n]*(?=\])
RegEx Demo 3
are good options to explore.
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
RegEx Circuit
jex.im visualizes regular expressions:
Test
const regex = /\[([^\]\[\r\n]*)\]/gm;
const str = `This is a [sample] string with [some] special words. [another one]
This is a [sample string with [some special words. [another one
This is a [sample[sample]] string with [[some][some]] special words. [[another one]]`;
let m;
while ((m = regex.exec(str)) !== null) {
// This is necessary to avoid infinite loops with zero-width matches
if (m.index === regex.lastIndex) {
regex.lastIndex++;
}
// The result can be accessed through the `m`-variable.
m.forEach((match, groupIndex) => {
console.log(`Found match, group ${groupIndex}: ${match}`);
});
}
Source
Regular expression to match balanced parentheses
(?<=\[).*?(?=\]) works good as per explanation given above. Here's a Python example:
import re
str = "Pagination.go('formPagination_bottom',2,'Page',true,'1',null,'2013')"
re.search('(?<=\[).*?(?=\])', str).group()
"'formPagination_bottom',2,'Page',true,'1',null,'2013'"
The #Tim Pietzcker's answer here
(?<=\[)[^]]+(?=\])
is almost the one I've been looking for. But there is one issue that some legacy browsers can fail on positive lookbehind.
So I had to made my day by myself :). I manged to write this:
/([^[]+(?=]))/g
Maybe it will help someone.
console.log("this is a [sample] string with [some] special words. [another one]".match(/([^[]+(?=]))/g));
if you want fillter only small alphabet letter between square bracket a-z
(\[[a-z]*\])
if you want small and caps letter a-zA-Z
(\[[a-zA-Z]*\])
if you want small caps and number letter a-zA-Z0-9
(\[[a-zA-Z0-9]*\])
if you want everything between square bracket
if you want text , number and symbols
(\[.*\])
This code will extract the content between square brackets and parentheses
(?:(?<=\().+?(?=\))|(?<=\[).+?(?=\]))
(?: non capturing group
(?<=\().+?(?=\)) positive lookbehind and lookahead to extract the text between parentheses
| or
(?<=\[).+?(?=\]) positive lookbehind and lookahead to extract the text between square brackets
In R, try:
x <- 'foo[bar]baz'
str_replace(x, ".*?\\[(.*?)\\].*", "\\1")
[1] "bar"
([[][a-z \s]+[]])
Above should work given the following explaination
characters within square brackets[] defines characte class which means pattern should match atleast one charcater mentioned within square brackets
\s specifies a space
 + means atleast one of the character mentioned previously to +.
I needed including newlines and including the brackets
\[[\s\S]+\]
If someone wants to match and select a string containing one or more dots inside square brackets like "[fu.bar]" use the following:
(?<=\[)(\w+\.\w+.*?)(?=\])
Regex Tester

How to write a regex OR statement within strapply in R

I have been using strapplyc in R to select different portions of a string that match one particular set of criteria. These have worked successfully until I found a portion of the string where the required portion could be defined one of two ways.
Here is an example of the string which is liberally sprinkled with \t:
\t\t\tsome words here\t\t\tDefect: some more words here Action: more words
I can write the strapply statement to capture the text between Defect: and the start of Action:
strapplyc(record[i], "Defect:(.*?)Action")
This works and selects the chosen text between Defect: and Action. In some cases there is no action section to the string and I've used the following code to capture these cases.
strapplyc(record[i], "Defect:(.*?)$")
What I have been trying to do is capture the text that either ends with Action, or with the end of the string (using $).
This is the bit that keeps failing. It returns nothing for either option. Here is my failing code:
strapplyc(record[i], "Defect:(.*?)Action|$")
Any idea where I'm going wrong, or a better solution would be much appreciated.
If you are up for a more efficient solution, you could drop the .*? matching and unroll your pattern like:
Defect:((?:[^A]+|A(?!ction))*)
This matches Defect: followed by any amount of characters that are not an A or are an A and not followed by ction. This avoids the expanding that is needed for the lazy dot matching. It will work for both ways, as it does stop matching when it hits Action or the end of your string.
As suggested by Wiktor, you can also use
Defect:([^A]*(?:A(?!ction)[^A]*)*)
Which is a little bit faster when there are many As in the string.
You might want to consider to use A(?!ction:) or A(?!ction\s*:), to avoid false early matches.
The alternation operator | is the regex operator with the lowest precedence. That means the regex Defect:(.*?)Action|$ is actually a combination of Defect:(.*?)Action and $ - since an empty string is a valid match for $, your regex returns the empty string.
To solve that, you should combine the regexes Defect:(.*?)Action and Defect:(.*?)$ with an OR:
Defect:(.*?)Action|Defect:(.*?)$
Or you can enclose Action|$ in a group as Sebastian Proske said in the comments:
Defect:(.*?)(?:Action|$)

ASP.NET regular expression to restrict consecutive characters

Using ASP.NET syntax for the RegularExpressionValidator control, how do you specify restriction of two consecutive characters, say character 'x'?
You can provide a regex like the following:
(\\w)\\1+
(\\w) will match any word character, and \\1+ will match whatever character was matched with (\\w).
I do not have access to asp.net at the moment, but take this console app as an example:
Console.WriteLine(regex.IsMatch("hello") ? "Not valid" : "Valid"); // Hello contains to consecutive l:s, hence not valid
Console.WriteLine(regex.IsMatch("Bar") ? "Not valid" : "Valid"); // Bar does not contain any consecutive characters, so it's valid
Alexn is right, this is the way you match consecutive characters with a regex, i.e. (a)\1 matches aa.
However, I think this is a case of everything looking like a nail when you're holding a hammer. I would not use regex to validate this input. Rather, I suggest validating this in code (just looping through the string, comparing str[i] and str[i-1], checking for this condition).
This should work:
^((?<char>\w)(?!\k<char>))*$
It matches abc, but not abbc.
The key is to use so called "zero-width negative lookahead assertion" (syntax: (?! subexpression)).
Here we make sure that a group matched with (?<char>\w) is not followed by itself (expressed with (?!\k<char>)).
Note that \w can be replaced with any valid set of characters (\w does not match white-spaces characters).
You can also do it without named group (note that the referenced group has number 2):
^((\w)(?!\2))*$
And its important to start with ^ and end with $ to match the whole text.
If you want to only exclude text with consecutive x characters, you may use this
^((?<char>x)(?!\k<char>)|[^x\W])*$
or without backreferences
^(x(?!x)|[^x\W])*$
All syntax elements for .NET Framework Regular Expressions are explained here.
You can use a regex to validate what's wrong as well as what's right of course. The regex (.)\1 will match any two consecutive characters, so you can just reject any input that gives an IsValid result to that. If this is the only validation you need, I think this way is far easier than trying to come up with a regex to validate correct input instead.

Regex for anything between []

I need to find the regex for []
For eg, if the string is - Hi [Stack], Here is my [Tag] which i need to [Find].
It should return
Stack, Tag, Find
Pretty simple, you just need to (1) escape the brackets with backslashes, and (2) use (.*?) to capture the contents.
\[(.*?)\]
The parentheses are a capturing group, they capture their contents for later use. The question mark after .* makes the matching non-greedy. This means it will match the shortest match possible, rather than the longest one. The difference between greedy and non-greedy comes up when you have multiple matches in a line:
Hi [Stack], Here is my [Tag] which i need to [Find].
^______________________________________________^
A greedy match will find the longest string possible between two sets of square brackets. That's not right. A non-greedy match will find the shortest:
Hi [Stack], Here is my [Tag] which i need to [Find].
^_____^
Anyways, the code will end up looking like:
string regex = #"\[(.*?)\]";
string text = "Hi [Stack], Here is my [Tag] which i need to [Find].";
foreach (Match match in Regex.Matches(text, regex))
{
Console.WriteLine("Found {0}", match.Groups[1].Value);
}
\[([\w]+?)\]
should work. You might have to change the matching group if you need to include special chars as well.
Depending on what environment you mean:
\[([^\]]+)]
.NET syntax, taking care of multiple embedded brackets:
\[ ( (?: \\. | (?<OPEN> \[) | (?<-OPEN> \]) | [^\]] )*? (?(OPEN)(?!)) ) \]
This counts the number of opened [ sections in OPEN and only succeeds if OPEN is 0 in the end.
I encountered a similar issue and discovered that this also does the trick.
\[\w{1,}\]
The \w means Metacharacter. This will match 1 or more word characters.
Using n{X,} quantifier matches any string where you can obtain different amounts. With the second number left out on purpose, the expression means 1 or more characters to match.

Resources