match all parentheses between two curly brackets - r

I'm trying to find a RegEx pattern that lets me match on all parentheses (and their content) as long as these parentheses are between { and }.
Examples:
{foo (i,j) bar} should match on (i,j)
{(i,j) foo (k,l) bar (m,n,o)} should match on (i,j), (k,l), and (m,n,o).
foo (i,j) bar should not match on anything because the string is not between swirly brackets.
{foo (i,j) bar} (k,l) should match on (i,j) but not (k,l) because the latter is outside of the swirly brackets.
The closest I came was with this pattern: (?<=\{)[^\(].*\(.*?\).*(?=\}). This pattern matched on the first, second, and fourth example, but matched on all of the content between the swirly brackets instead of only the parentheses and their content.

You can use
(?:\G(?!\A)|{)[^{}]*?\K\([^()]*\)
See the regex demo. If you want to make absolutely sure there is a closing } on the right, add a (?=[^{}]*}) positive lookahead at the end:
(?:\G(?!\A)|{)[^{}]*?\K\([^()]*\)(?=[^{}]*})
See this regex demo.
Details
(?:\G(?!\A)|{) - either end of the previous successful match or a { char
[^{}]*? - zero or more chars other than { and }, as few as possible
\K - match reset operator that discards all text matched so far from the current overall match memory buffer
\( - a ( char
[^()]* - zero or more chars other than ( and ) as many as possible
\) - a ) char
(?=[^{}]*}) - immediately on the right, there must be zero or more chars other than { and } and then a }.
See an R demo online:
x <- "{(i,j) foo (k,l) bar (m,n,o)} should match on (h,j), (a,s), and (i,o,g)."
regmatches(x, gregexpr("(?:\\G(?!\\A)|{)[^{}]*?\\K\\([^()]*\\)(?=[^{}]*})", x, perl=TRUE))
# [[1]]
# [1] "(i,j)" "(k,l)" "(m,n,o)"

Related

negative-lookahead in gsub

In a recent scenario I wanted to extract the very last part of a vector of url's.
Eg.
> urls <- c('https::abc/efg/hij/', 'https::abc/efg/hij/lmn/', 'https::abc/efg/hij/lmn/opr/')
> rs <- regexpr("([^/])*(?=/$)", urls, perl = TRUE)
> substr(urls, rs, rs + attr(rs, 'match.length'))
[1] "hij/" "lmn/" "opr/"
which is somewhat simple to read. But I'd like to understand how I could do something similar by inverting the lookahead expression, eg. remove the second to last '/' and anything preceding (assuming that the string always ends with '/'). I can't seem to get the exact logic straight,
> gsub('([^/]|[/])(?!([^/]*/)$)', '', urls, perl = TRUE)
[1] "/hij" "/lmn" "/opr"
Basically I'm looking for the regexp logic that would return the result in the first example, but using only a single gsub call.
To get a match only, you could still use the lookahead construct:
^.*/(?=[^/]*/$)
^ Start of the string
.*/ Match until the last /
(?= Positive lookahead, assert what is on the right is
[^/]*/$ assert what is at the right is 0+ times any char except /, then match / at end of string
) Close lookahead
Regex demo | R example
For example
gsub('^.*/(?=[^/]*/$)', '', urls, perl = TRUE)
An option using a negative lookahead:
^.*/(?!$)
^ Start of string
.*/ Match the last /
(?!$) Negative lookahead, assert what is directly to the right is not the end of the string
Regex demo
The non-regex & very quick solution would be to use basename():
basename(urls)
[1] "hij" "lmn" "opr"
Or, for your case:
paste0(basename(urls), '/')
[1] "hij/" "lmn/" "opr/"
my prefered method is to replace the whole string with parts of the string, like so:
gsub("^.*/([^/]+/)$", "\\1", urls)
The "\\1" matches whatever was matched inside ().
So Basically I am replacing the whole string with the last part of the url.

Matching series of Ampersands in R?

I am unable to solve the below question.Requesting all to help me in this regard.
I have series of ampersands(&) in my data, I want to replace pair of ampersands with some value, but for some reason I am unable to do it.
My attempt and example:
string1 <- "This aa should be replaced: but this aaa shouldn't"
string2 <- "This && should be replaced: but this &&& shouldn't"
gsub("aa", "XXX", string1) #1.
gsub("\\baa\\b", "XXX", string1) #2.
gsub("&&", "XXX", string2) #3.
gsub("\\b&&\\b", "XXX", string2) #4.
Above, if I want to match 'aa' from string1, I can have two approaches,
In approach 1 (denoted as : #1), I can simply pass 'aa' but this will also match 'aaa' partially, which I don't want, I want my regex to match exactly pairs of 'a', which in my case is 'aa'.
To solve this I use regex (#2), In this case it is working fine.
Now, in string2, I expected a similar behavior, where instead of matching pair of 'a' I want to match pair of '&&' which is not matching.
The (#3) attempt is working, but that is not the result I want as it is also matching partially '&&&',
The (#4) attempt is not working for some reason and its not replacing the string.
My question is:
1) Why pair of ampersands are not working with boundary conditions ?
2) What is the way around to solve this problem ?
I really had the hard time, and wasted my entire day due to this and really feeling bad, tried finding the solution on google, not yet successful.
In case some one know, if its there please redirect me to a post. OR if someone finds its a duplicate please let me know, I will remove it.
Thanks for your help and reading the question.
EDIT: My word boundary is space for now.
Outputs:
> gsub("aa", "XXX", string1)
[1] "This XXX should be replaced: but this XXXa shouldn't"
> gsub("\\baa\\b", "XXX", string1)
[1] "This XXX should be replaced: but this aaa shouldn't"
>
> gsub("&&", "XXX", string2)
[1] "This XXX should be replaced: but this XXX& shouldn't"
> gsub("\\b&&\\b", "XXX", string2)
[1] "This && should be replaced: but this &&& shouldn't"
>
Note: I have also checked with perl=TRUE, but its not working.
The \b word boundary means:
There are three different positions that qualify as word boundaries:
Before the first character in the string, if the first character is a
word character.
After the last character in the string, if the last
character is a word character.
Between two characters in the string,
where one is a word character and the other is not a word character.
The "\\b&&\\b" pattern matches && when it is enclosed with word chars, letters, digits or _ chars.
To match whitespace boundaries, you may use
gsub("(?<!\\S)&&(?!\\S)", "XXX", string2, perl=TRUE)
The pattern matches
(?<!\\S) - a location not immediately preceded with a non-whitespace char (that is, there must be start of string or a whitespace char immediately to the left of the current location)
&& - a literal substring
(?!\\S) - a location not immediately followed with a non-whitespace char (that is, there must be end of string or a whitespace char immediately to the right of the current location).
More specific, but you could use a 2-step function like so
replace2steps <- function(mystring, toreplace,replacement, toexclude, intermediate) {
intermstring <- gsub(toexclude, intermediate,string2)
result <- gsub(toreplace, replacement, intermstring)
result <- gsub(intermediate, toexclude, result)
return(result)
}
replace2steps(string2, "&&", "XX", "&&&", "%%%")
[1] "This XX should be replaced: but this &&& shouldn't"

How to replace special characters using regex

Using Asp.net for regex.
I've written an extension method that I want to use to replace whole words - a word might also be a single special character like '&'.
In this case I want to replace '&' with 'and', and I'll need to use the same technique to reverse it back from 'and' to '&', so it must work for whole words only and not extended words like 'hand'.
I've tried a few variations for the regex pattern - started with '\bWORD\b' which didn't work at all for the ampersand, and now have '\sWORD\s' which almost works except that it also removes the spaces around the word, meaning that a phrase like "health & beauty" ends up as "healthandbeauty".
Any help appreciated.
Here's the extension method:
public static string ReplaceWord(this string #this,
string wordToFind,
string replacement,
RegexOptions regexOptions = RegexOptions.None)
{
Guard.String.NotEmpty(() => #this);
Guard.String.NotEmpty(() => wordToFind);
Guard.String.NotEmpty(() => replacement);
var pattern = string.Format(#"\s{0}\s", wordToFind);
return Regex.Replace(#this, pattern, replacement, regexOptions);
}
In order to match a dynamic string that should be enclosed with spaces (or be located at the start or end of string), you can use negative lookaheads:
var pattern = string.Format(#"(?<!\S){0}(?!\S)", wordToFind);
^^^^^^^ ^^^^^^
or even safer:
var pattern = string.Format(#"(?<!\S){0}(?!\S)", Regex.Escape(wordToFind));
^^^^^^^^^^^^^
The (?<!\S) lookbehind will fail the match if the word is not preceded with a non-whitespace character and (?!\S) lookahead will fail the match if the word is not followed with a non-whitespace character.

How to convert BNF to EBNF

How can I convert this BNF to EBNF?
<vardec> ::= var <vardeclist>;
<vardeclist> ::= <varandtype> {;<varandtype>}
<varandtype> ::= <ident> {,<ident>} : <typespec>
<ident> ::= <letter> {<idchar>}
<idchar> ::= <letter> | <digit> | _
EBNF or Extended Backus-Naur Form is ISO 14977:1996, and is available in PDF from ISO for free*. It is not widely used by the computer language standards. There's also a paper that describes it, and that paper contains this table summarizing EBNF notation.
Table 1: Extended BNF
Extended BNF Operator Meaning
-------------------------------------------------------------
unquoted words Non-terminal symbol
" ... " Terminal symbol
' ... ' Terminal symbol
( ... ) Brackets
[ ... ] Optional symbols
{ ... } Symbols repeated zero or more times
{ ... }- Symbols repeated one or more times†
= in Defining symbol
; post Rule terminator
| in Alternative
, in Concatenation
- in Except
* in Occurrences of
(* ... *) Comment
? ... ? Special sequence
The * operator is used with a preceding (unsigned) integer number; it does not seem to allow for variable numbers of repetitions — such as 1-15 characters after an initial character to make identifiers up to 16 characters long. This lis
In the standard, open parenthesis ( is called start group symbol and close parenthesis ) is called end group symbol; open square bracket [ is start option symbol and close square bracket is end option symbol; open brace { is start repeat symbol and close brace } is end repeat symbol. Single quotes ' are called first quote symbol and double quotes " are second quote symbol.
* Yes, free — even though you can also pay 74 CHF for it if you wish. Look at the Note under the box containing the chargeable items.
The question seeks to convert this 'BNF' into EBNF:
<vardec> ::= var <vardeclist>;
<vardeclist> ::= <varandtype> {;<varandtype>}
<varandtype> ::= <ident> {,<ident>} : <typespec>
<ident> ::= <letter> {<idchar>}
<idchar> ::= <letter> | <digit> | _
The BNF is not formally defined, so we have to make some (easy) guesses as to what it means. The translation is routine (it could be mechanical if the BNF is formally defined):
vardec = 'var', vardeclist, ';';
vardeclist = varandtype, { ';', varandtype };
varandtype = ident, { ',', ident }, ':', typespec;
ident = letter, { idchar };
idchar = letter | digit | '_';
The angle brackets have to be removed around non-terminals; the definition symbol ::= is replaced by =; the terminals such as ; and _ are enclosed in quotes; concatenation is explicitly marked with ,; and each rule is ended with ;. The grouping and alternative operations in the original happen to coincide with the standard notation. Note that explicit concatenation with the comma means that multi-word non-terminals are unambiguous.
† Casual study of the standard itself suggests that the {...}- notation is not part of the standard, just of the paper. However, as jmmut notes in a comment, the standard does define the meaning of {…}-:
§5.8 Syntactic term
…
When a syntactic-term is a syntactic-factor followed by
an except-symbol followed by a syntactic-exception it
represents any sequence of symbols that satisfies both of
the conditions:
a) it is a sequence of symbols represented by the syntactic-factor,
b) it is not a sequence of symbols represented by the
syntactic-exception.
…
NOTE - { "A" } - represents a sequence of one or more A's because it is a syntactic-term with an empty syntactic-exception.
Remove the angle brackets and put all terminals into quotes:
vardec ::= "var" vardeclist;
vardeclist ::= varandtype { ";" varandtype }
varandtype ::= ident { "," ident } ":" typespec
ident ::= letter { idchar }
idchar ::= letter | digit | "_"

Regular Expression required for condition

I need a regular expression which can match a string with the following requirements:
Must be between 6 and 64 characters long
Cannot include the following symbols : #, &, ', <, >, !, ", /, #, $, %, +, ?, (, ), *, [ , ] , \ , { , }
Cannot contain spaces, tabs, or consecutive underscores, i.e. __
Cannot contain elements that imply an email address or URL, such as ".com", ".net", ".org", ".edu" or any variation (e.g. "_com" or "-com")
Cannot start with underscore '_', dash '-' or period '.'
Cannot contain the words "honey" or "allied"
Cannot contain single letter followed by numbers
This is better done with several regular expressions! And some of your conditions don't even need regexes (in fact, they would be counter productive).
use a string length function
use a function looking up for that character in your string;
match against _{2,} and \s
match against [._-](?:com|net|....)
use a string function looking for these characters at the first position, or ^[-._]
whole words? What about "calliedaaa"? If whole words, match against \b(?:honey|allied)\b, otherwise use a string lookup function
match against \w\d+

Resources