Number after captured group in regex - r

I want to write a simple RegEx to add leading zeros to my R code. Simplest way is to find (\s)\.(\d) and replace it with \10.\2. But it doesn't work in R as it apparently thinks it's 10th captured group rather than 1st followed by a literal 0. According to this question RStudio uses PCRE but no method for PCRE (or any other engine) from those described here works in RStudio find & replace feature. Is it possible to put a number after a captured group without leaving RStudio?

As a work-around, you can use lookarounds here:
Search for: (?<=\s)\.(?=\d)
Replace with: 0.
See the regex demo.

Related

Extract mm/dd/yyyy and m/dd/yyyy dates from string in R [duplicate]

My regex pattern looks something like
<xxxx location="file path/level1/level2" xxxx some="xxx">
I am only interested in the part in quotes assigned to location. Shouldn't it be as easy as below without the greedy switch?
/.*location="(.*)".*/
Does not seem to work.
You need to make your regular expression lazy/non-greedy, because by default, "(.*)" will match all of "file path/level1/level2" xxx some="xxx".
Instead you can make your dot-star non-greedy, which will make it match as few characters as possible:
/location="(.*?)"/
Adding a ? on a quantifier (?, * or +) makes it non-greedy.
Note: this is only available in regex engines which implement the Perl 5 extensions (Java, Ruby, Python, etc) but not in "traditional" regex engines (including Awk, sed, grep without -P, etc.).
location="(.*)" will match from the " after location= until the " after some="xxx unless you make it non-greedy.
So you either need .*? (i.e. make it non-greedy by adding ?) or better replace .* with [^"]*.
[^"] Matches any character except for a " <quotation-mark>
More generic: [^abc] - Matches any character except for an a, b or c
How about
.*location="([^"]*)".*
This avoids the unlimited search with .* and will match exactly to the first quote.
Use non-greedy matching, if your engine supports it. Add the ? inside the capture.
/location="(.*?)"/
Use of Lazy quantifiers ? with no global flag is the answer.
Eg,
If you had global flag /g then, it would have matched all the lowest length matches as below.
Here's another way.
Here's the one you want. This is lazy [\s\S]*?
The first item:
[\s\S]*?(?:location="[^"]*")[\s\S]* Replace with: $1
Explaination: https://regex101.com/r/ZcqcUm/2
For completeness, this gets the last one. This is greedy [\s\S]*
The last item:[\s\S]*(?:location="([^"]*)")[\s\S]*
Replace with: $1
Explaination: https://regex101.com/r/LXSPDp/3
There's only 1 difference between these two regular expressions and that is the ?
The other answers here fail to spell out a full solution for regex versions which don't support non-greedy matching. The greedy quantifiers (.*?, .+? etc) are a Perl 5 extension which isn't supported in traditional regular expressions.
If your stopping condition is a single character, the solution is easy; instead of
a(.*?)b
you can match
a[^ab]*b
i.e specify a character class which excludes the starting and ending delimiiters.
In the more general case, you can painstakingly construct an expression like
start(|[^e]|e(|[^n]|n(|[^d])))end
to capture a match between start and the first occurrence of end. Notice how the subexpression with nested parentheses spells out a number of alternatives which between them allow e only if it isn't followed by nd and so forth, and also take care to cover the empty string as one alternative which doesn't match whatever is disallowed at that particular point.
Of course, the correct approach in most cases is to use a proper parser for the format you are trying to parse, but sometimes, maybe one isn't available, or maybe the specialized tool you are using is insisting on a regular expression and nothing else.
Because you are using quantified subpattern and as descried in Perl Doc,
By default, a quantified subpattern is "greedy", that is, it will
match as many times as possible (given a particular starting location)
while still allowing the rest of the pattern to match. If you want it
to match the minimum number of times possible, follow the quantifier
with a "?" . Note that the meanings don't change, just the
"greediness":
*? //Match 0 or more times, not greedily (minimum matches)
+? //Match 1 or more times, not greedily
Thus, to allow your quantified pattern to make minimum match, follow it by ? :
/location="(.*?)"/
import regex
text = 'ask her to call Mary back when she comes back'
p = r'(?i)(?s)call(.*?)back'
for match in regex.finditer(p, str(text)):
print (match.group(1))
Output:
Mary

Regex to find words from list, when specific words not appear 3 words before

I want to find all matches of specific words from list, but when specific another words not appears in the range of 3 words before.
For example:
Find all the times that the words "good|best|better" appears in the text, but the words "no|not|none" not appears 3 words before.
I tried something like that:
(?<!\sno|\snot(\s|\s\w\s|\s\w\s\w\s))(\bgood\b|\bbest\b|\bbetter\b)
But it's not working.
You may be able to use this PCRE regex in R with perl=TRUE option:
\b(?:not?|none)(?:\s+\S+){0,2}\s+(good|best|better)\b(*SKIP)(*F)|\b(?:good|best|better)\b
RegEx Demo
In your R code use:
gregexpr("\\b(?:not?|none)(?:\\s+\\S+){0,2}\\s+(good|best|better)\\b(*SKIP)(*F)|\\b(?:good|best|better)\\b", mystr, perl=TRUE)
In PCRE, verbs (*SKIP)(*F) are used to fail and skip a match that we don't want to match.
If we would be only looking to fail no and other derivatives of that, we would be starting with a simple expression such as:
^(?!.*no).*times.*$
Then, we would add word boundary if necessary, and we would expand that to:
^(?!.*\bno\b|.*\bnot\b|.*\bnone\b).*times.*$
Demo 1
and finally we would add our desired words using:
^(?!.*\bno\b|.*\bnot\b|.*\bnone\b)(?=.*\bgood\b|.*\bbest\b|.*\bbetter\b).*times.*$
Demo 2
RegEx Circuit
jex.im visualizes regular expressions:

The R-function history() doesn't accept my pattern regexp

I experimented with a pattern to reject any line beginning with 0 to N spaces followed by a "#" . At this test webpage, the following pattern works fine:
^(?!(\s*[#])).*
I used as test lines of text the following:
#tbadword
#test
one two
abadwo#rds
#three
And only the "non-comment" lines are selected.
But in R, using the Windows Rgui , if I try
> history(Inf, pattern = '^(?!(\\s*[#])).*' )
I get the error message "Invalid regexp" .
Can someone point out what R is unhappy with here? Do I need to set a global "perl=TRUE" or some such thing? Or is there a simpler way to do this?
The history() command has a ... for values that will be passed to grep(), so you can use the invert= flag rather than a look-ahead to find what you need. How about
history(Inf, pattern="^\\s*#", invert=TRUE)
You may have R history parse your regex with a PCRE regex engine:
history(Inf, pattern="^(?!\\s*#)", perl=TRUE)
Now, ^(?!\s*#) will be parsed correctly as
^ - start of string
(?!\s*#) - a negative lookahead that fails the match if, immediately to the right of the current location (i.e. at the start of the string) there are 0+ whitespaces and then #.
Although the solution with invert=TRUE and an opposite regex is more natural for the current scenario, you may need the more advanced regex functionality for other cases, and perl=TRUE will help cover them.

gsub seems to be replacing everything between the first and last character in of the pattern, rather than repeatedly replacing the pattern [duplicate]

This question already has answers here:
Regular expression to stop at first match
(9 answers)
Closed 2 years ago.
I have this gigantic ugly string:
J0000000: Transaction A0001401 started on 8/22/2008 9:49:29 AM
J0000010: Project name: E:\foo.pf
J0000011: Job name: MBiek Direct Mail Test
J0000020: Document 1 - Completed successfully
I'm trying to extract pieces from it using regex. In this case, I want to grab everything after Project Name up to the part where it says J0000011: (the 11 is going to be a different number every time).
Here's the regex I've been playing with:
Project name:\s+(.*)\s+J[0-9]{7}:
The problem is that it doesn't stop until it hits the J0000020: at the end.
How do I make the regex stop at the first occurrence of J[0-9]{7}?
Make .* non-greedy by adding '?' after it:
Project name:\s+(.*?)\s+J[0-9]{7}:
Using non-greedy quantifiers here is probably the best solution, also because it is more efficient than the greedy alternative: Greedy matches generally go as far as they can (here, until the end of the text!) and then trace back character after character to try and match the part coming afterwards.
However, consider using a negative character class instead:
Project name:\s+(\S*)\s+J[0-9]{7}:
\S means “everything except a whitespace and this is exactly what you want.
Well, ".*" is a greedy selector. You make it non-greedy by using ".*?" When using the latter construct, the regex engine will, at every step it matches text into the "." attempt to match whatever make come after the ".*?". This means that if for instance nothing comes after the ".*?", then it matches nothing.
Here's what I used. s contains your original string. This code is .NET specific, but most flavors of regex will have something similar.
string m = Regex.Match(s, #"Project name: (?<name>.*?) J\d+").Groups["name"].Value;
I would also recommend you experiment with regular expressions using "Expresso" - it's a utility a great (and free) utility for regex editing and testing.
One of its upsides is that its UI exposes a lot of regex functionality that people unexprienced with regex might not be familiar with, in a way that it would be easy for them to learn these new concepts.
For example, when building your regex using the UI, and choosing "*", you have the ability to check the checkbox "As few as possible" and see the resulting regex, as well as test its behavior, even if you were unfamiliar with non-greedy expressions before.
Available for download at their site:
http://www.ultrapico.com/Expresso.htm
Express download:
http://www.ultrapico.com/ExpressoDownload.htm
(Project name:\s+[A-Z]:(?:\\w+)+.[a-zA-Z]+\s+J[0-9]{7})(?=:)
This will work for you.
Adding (?:\\w+)+.[a-zA-Z]+ will be more restrictive instead of .*

Creating RegEx That Reads Entire String

My current regex is only picking up part of my string. It creates a match as soon as one if found, even though I need the longer version of that match to hit. For example, I am creating matches for both:
SSS111
and
SSS111-L
The first SSS111 matches fine with my current regex, but the SSS111-L is only getting matched to the SSS111, leaving the -L out.
How can I create a greedy regex to read the whole line before matching? I am currently using
[-A-Z0-9]{3,12}
to capture the numbers and letters, but have not had any luck outside of this.
Regex are allways greedy. This ist mostly the Problem.
Here i think you have only to escape the '-'
#"[-A-Z]{3-12}"

Resources