How to properly use capture reference in R using grepl function [duplicate] - r

I'm a regular expression newbie and I can't quite figure out how to write a single regular expression that would "match" any duplicate consecutive words such as:
Paris in the the spring.
Not that that is related.
Why are you laughing? Are my my regular expressions THAT bad??
Is there a single regular expression that will match ALL of the bold strings above?

Try this regular expression:
\b(\w+)\s+\1\b
Here \b is a word boundary and \1 references the captured match of the first group.
Regex101 example here

I believe this regex handles more situations:
/(\b\S+\b)\s+\b\1\b/
A good selection of test strings can be found here: http://callumacrae.github.com/regex-tuesday/challenge1.html

The below expression should work correctly to find any number of duplicated words. The matching can be case insensitive.
String regex = "\\b(\\w+)(\\s+\\1\\b)+";
Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(input);
// Check for subsequences of input that match the compiled pattern
while (m.find()) {
input = input.replaceAll(m.group(0), m.group(1));
}
Sample Input : Goodbye goodbye GooDbYe
Sample Output : Goodbye
Explanation:
The regex expression:
\b : Start of a word boundary
\w+ : Any number of word characters
(\s+\1\b)* : Any number of space followed by word which matches the previous word and ends the word boundary. Whole thing wrapped in * helps to find more than one repetitions.
Grouping :
m.group(0) : Shall contain the matched group in above case Goodbye goodbye GooDbYe
m.group(1) : Shall contain the first word of the matched pattern in above case Goodbye
Replace method shall replace all consecutive matched words with the first instance of the word.

Try this with below RE
\b start of word word boundary
\W+ any word character
\1 same word matched already
\b end of word
()* Repeating again
public static void main(String[] args) {
String regex = "\\b(\\w+)(\\b\\W+\\b\\1\\b)*";// "/* Write a RegEx matching repeated words here. */";
Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE/* Insert the correct Pattern flag here.*/);
Scanner in = new Scanner(System.in);
int numSentences = Integer.parseInt(in.nextLine());
while (numSentences-- > 0) {
String input = in.nextLine();
Matcher m = p.matcher(input);
// Check for subsequences of input that match the compiled pattern
while (m.find()) {
input = input.replaceAll(m.group(0),m.group(1));
}
// Prints the modified sentence.
System.out.println(input);
}
in.close();
}

Regex to Strip 2+ duplicate words (consecutive/non-consecutive words)
Try this regex that can catch 2 or more duplicate words and only leave behind one single word. And the duplicate words need not even be consecutive.
/\b(\w+)\b(?=.*?\b\1\b)/ig
Here, \b is used for Word Boundary, ?= is used for positive lookahead, and \1 is used for back-referencing.
Example
Source

The widely-used PCRE library can handle such situations (you won't achieve the the same with POSIX-compliant regex engines, though):
(\b\w+\b)\W+\1

Here is one that catches multiple words multiple times:
(\b\w+\b)(\s+\1)+

No. That is an irregular grammar. There may be engine-/language-specific regular expressions that you can use, but there is no universal regular expression that can do that.

This is the regex I use to remove duplicate phrases in my twitch bot:
(\S+\s*)\1{2,}
(\S+\s*) looks for any string of characters that isn't whitespace, followed whitespace.
\1{2,} then looks for more than 2 instances of that phrase in the string to match. If there are 3 phrases that are identical, it matches.

Since some developers are coming to this page in search of a solution which not only eliminates duplicate consecutive non-whitespace substrings, but triplicates and beyond, I'll show the adapted pattern.
Pattern: /(\b\S+)(?:\s+\1\b)+/ (Pattern Demo)
Replace: $1 (replaces the fullstring match with capture group #1)
This pattern greedily matches a "whole" non-whitespace substring, then requires one or more copies of the matched substring which may be delimited by one or more whitespace characters (space, tab, newline, etc).
Specifically:
\b (word boundary) characters are vital to ensure partial words are not matched.
The second parenthetical is a non-capturing group, because this variable width substring does not need to be captured -- only matched/absorbed.
the + (one or more quantifier) on the non-capturing group is more appropriate than * because * will "bother" the regex engine to capture and replace singleton occurrences -- this is wasteful pattern design.
*note if you are dealing with sentences or input strings with punctuation, then the pattern will need to be further refined.

The example in Javascript: The Good Parts can be adapted to do this:
var doubled_words = /([A-Za-z\u00C0-\u1FFF\u2800-\uFFFD]+)\s+\1(?:\s|$)/gi;
\b uses \w for word boundaries, where \w is equivalent to [0-9A-Z_a-z]. If you don't mind that limitation, the accepted answer is fine.

This expression (inspired from Mike, above) seems to catch all duplicates, triplicates, etc, including the ones at the end of the string, which most of the others don't:
/(^|\s+)(\S+)(($|\s+)\2)+/g, "$1$2")
I know the question asked to match duplicates only, but a triplicate is just 2 duplicates next to each other :)
First, I put (^|\s+) to make sure it starts with a full word, otherwise "child's steak" would go to "child'steak" (the "s"'s would match). Then, it matches all full words ((\b\S+\b)), followed by an end of string ($) or a number of spaces (\s+), the whole repeated more than once.
I tried it like this and it worked well:
var s = "here here here here is ahi-ahi ahi-ahi ahi-ahi joe's joe's joe's joe's joe's the result result result";
print( s.replace( /(\b\S+\b)(($|\s+)\1)+/g, "$1"))
--> here is ahi-ahi joe's the result

Try this regular expression it fits for all repeated words cases:
\b(\w+)\s+\1(?:\s+\1)*\b

I think another solution would be to use named capture groups and backreferences like this:
.* (?<mytoken>\w+)\s+\k<mytoken> .*/
OR
.*(?<mytoken>\w{3,}).+\k<mytoken>.*/
Kotlin:
val regex = Regex(""".* (?<myToken>\w+)\s+\k<myToken> .*""")
val input = "This is a test test data"
val result = regex.find(input)
println(result!!.groups["myToken"]!!.value)
Java:
var pattern = Pattern.compile(".* (?<myToken>\\w+)\\s+\\k<myToken> .*");
var matcher = pattern.matcher("This is a test test data");
var isFound = matcher.find();
var result = matcher.group("myToken");
System.out.println(result);
JavaScript:
const regex = /.* (?<myToken>\w+)\s+\k<myToken> .*/;
const input = "This is a test test data";
const result = regex.exec(input);
console.log(result.groups.myToken);
// OR
const regex = /.* (?<myToken>\w+)\s+\k<myToken> .*/g;
const input = "This is a test test data";
const result = [...input.matchAll(regex)];
console.log(result[0].groups.myToken);
All the above detect the test as the duplicate word.
Tested with Kotlin 1.7.0-Beta, Java 11, Chrome and Firefox 100.

You can use this pattern:
\b(\w+)(?:\W+\1\b)+
This pattern can be used to match all duplicated word groups in sentences. :)
Here is a sample util function written in java 17, which replaces all duplications with the first occurrence:
public String removeDuplicates(String input) {
var regex = "\\b(\\w+)(?:\\W+\\1\\b)+";
var pattern = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
var matcher = pattern.matcher(input);
while (matcher.find()) {
input = input.replaceAll(matcher.group(), matcher.group(1));
}
return input;
}

As far as I can see, none of these would match:
London in the
the winter (with the winter on a new line )
Although matching duplicates on the same line is fairly straightforward,
I haven't been able to come up with a solution for the situation in which they
stretch over two lines. ( with Perl )

To find duplicate words that have no leading or trailing non whitespace character(s) other than a word character(s), you can use whitespace boundaries on the left and on the right making use of lookarounds.
The pattern will have a match in:
Paris in the the spring.
Not that that is related.
The pattern will not have a match in:
This is $word word
(?<!\S)(\w+)\s+\1(?!\S)
Explanation
(?<!\S) Negative lookbehind, assert not a non whitespace char to the left of the current location
(\w+) Capture group 1, match 1 or more word characters
\s+ Match 1 or more whitespace characters (note that this can also match a newline)
\1 Backreference to match the same as in group 1
(?!\S) Negative lookahead, assert not a non whitespace char to the right of the current location
See a regex101 demo.
To find 2 or more duplicate words:
(?<!\S)(\w+)(?:\s+\1)+(?!\S)
This part of the pattern (?:\s+\1)+ uses a non capture group to repeat 1 or more times matching 1 or more whitespace characters followed by the backreference to match the same as in group 1.
See a regex101 demo.
Alternatives without using lookarounds
You could also make use of a leading and trailing alternation matching either a whitespace char or assert the start/end of the string.
Then use a capture group 1 for the value that you want to get, and use a second capture group with a backreference \2 to match the repeated word.
Matching 2 duplicate words:
(?:\s|^)((\w+)\s+\2)(?:\s|$)
See a regex101 demo.
Matching 2 or more duplicate words:
(?:\s|^)((\w+)(?:\s+\2)+)(?:\s|$)
See a regex101 demo.

Use this in case you want case-insensitive checking for duplicate words.
(?i)\\b(\\w+)\\s+\\1\\b

Related

Regex UTM google

I'm trying to extract a UTM from a Google link using r, but my regex doesn't seem to work properly.
Here an example of a google link :
xxx/yyy?utm_medium=display&utm_source=ogury&utm_campaign=TOTO&zzz=coco
I tried the following regex to extract TOTO:
.+&utm_campaign=([[a-z]]+)&.+
with no success.
If someone can help, thanks!
In your pattern, [[a-z]]+ is a malformed bracket expression, because it matches any char from the [[a-z] bracket expression (any lowercase ASCII letter or [) and then matches one or more ] chars. You meant to use single [ and ] here.
You may use sub with the following regex:
sub(".*[&?]utm_campaign=([^&]+).*", "\\1", s)
See the regex demo.
Details
.* - any 0+ chars, as many as possible
[&?] - a ? or &
utm_campaign= - a literal substring
([^&]+) - Capturing group 1: one or more chars other than & chars
.* - any 0+ chars, as many as possible
The \1 is the replacement backreference that puts the contents of Group 1 into the result.
See the R demo:
s <- "xxx/yyy?utm_medium=display&utm_source=ogury&utm_campaign=TOTO&zzz=coco"
sub(".*[&?]utm_campaign=([^&]+).*", "\\1", s)
## => [1] "TOTO"
You could use:
(?:&utm_campaign=)(\w+)
and use the first group captured
Try it Online
Here's a regex string that'll match the value of a utm_campaign parameter, regardless of its position in the query string.
(?<TOTO>(?<=utm_campaign=).*?(?=&|$))
Explanation:
?<TOTO> captures the result into a TOTO key after the regex is executed
(?<=utm_campaign=) is a look-behind that will ensure that the value is preceded by utm_campaign=
.*? will find the parameter value (i.e. TOTO). The reason for the ? is lazy evaluation - it will only search until the next rule is matched (see point below)
(?=&|$) is a look-ahead that will match either an & or the end of the string (in the case that utm_campaign is the last parameter)
You are searching for [[a-z]]+ however TOTO is uppercase, so not between 'a' and 'z'. You can update it to [[A-Za-z]]+ to match any case letter.
EDIT: [[A-Za-z]]+ will match any case letter, but will also match any '[' or ']' characters. If you do not wish to match these then you can change it to [A-Za-z]+ to only match any case letters

Find hyphonated words with Regex in list R

I have a string of semicolon-separated elements and I want to find if a pattern matchs with any of the elements in the string:
string <- "CPT1B;CPT1B;CPT1B;CHKB-CPT1B;CPT1B;CPT1B;CPT1B;CPT1B"
I want to know which regex use to match any of these elements, I mean, I want to get TRUE if any of the elements match with, for example, "CPT1B", to do so I use:
grepl(paste("[^;]","CPT1B,"[$;]",sep = ""),string)
TRUE
I used "[^;]" and "[$;]" because I want to get TRUE if any of the elements match.
My problem came when I try to match with "CHKB-CPT1B", because if I use the same expression:
grepl(paste("[^;]","CHKB-CPT1B","[$;]",sep = ""),string)
FALSE
I get FALSE, I think it's due to the hyphen in the word, and I'd like to know how to make grepl read the word with the hyphen as one word.
I don't want to use "CHKB\-CPT1B", because this pattern would came from an iterator that could be both hyphenated and non-hyphenated words. And I would also like not to split the original string by ";"
You need to use alternation groups:
grepl(paste0("(?:^|;)", "CPT1B", "(?:$|;)"),string)
[1] TRUE
The (?:^|;) non-capturing group matches start of string or ; and (?:$|;) matches either the end of string or ;.
You also may use lookarounds with perl=TRUE (i.e. a PCRE pattern):
grepl(paste0("(?<![^;])", "CPT1B", "(?![^;])"),string, perl=TRUE)
Here, the negative lookbehind (?<![^;]) matches any location that is immediately preceded with ; or start of string, and the negative lookahead (?![^;]) requires the next char to be ; or end of string location.

Regex to maintain matched parts

I would like to achieve this result : "raster(B04) + raster(B02) - raster(A10mB03)"
Therefore, I created this regex: B[0-1][0-9]|A[1,2,6]0m/B[0-1][0-9]"
I am now trying to replace all matches of the string "B04 + B02 - A10mB03" with gsub("B[0-1][0-9]]|[A[1,2,6]0mB[0-1][0-9]", "raster()", string)
How could I include the original values B01, B02, A10mB03?
PS: I also tried gsub("B[0-1][0-9]]|[A[1,2,6]0mB[0-1][0-9]", "raster(\\1)", string) but it did not work.
Basically, you need to match some text and re-use it inside a replacement pattern. In base R regex methods, there is no way to do that without a capturing group, i.e. a pair of unescaped parentheses, enclosing the whole regex pattern in this case, and use a \\1 replacement backreference in the replacement pattern.
However, your regex contains some issues: [A[1,2,6] gets parsed as a single character class that matches A, [, 1, ,, 2 or 6 because you placed a [ before A. Also, note that , inside character classes matches a literal comma, and it is not what you expected. Another, similar issue, is with [0-9]] - it matches any ASCII digit with [0-9] and then a ] (the ] char does not have to be escaped in a regex pattern).
So, a potential fix for you expression can look like
gsub("(B[0-1][0-9]|A[126]0mB[0-1][0-9])", "raster(\\1)", string)
Or even just matching 1 or more word chars (considering the sample string you supplied)
gsub("(\\w+)", "raster(\\1)", string)
might do.
See the R demo online.

sub command to extract data and split data frame column [duplicate]

Simple regex question. I have a string on the following format:
this is a [sample] string with [some] special words. [another one]
What is the regular expression to extract the words within the square brackets, ie.
sample
some
another one
Note: In my use case, brackets cannot be nested.
You can use the following regex globally:
\[(.*?)\]
Explanation:
\[ : [ is a meta char and needs to be escaped if you want to match it literally.
(.*?) : match everything in a non-greedy way and capture it.
\] : ] is a meta char and needs to be escaped if you want to match it literally.
(?<=\[).+?(?=\])
Will capture content without brackets
(?<=\[) - positive lookbehind for [
.*? - non greedy match for the content
(?=\]) - positive lookahead for ]
EDIT: for nested brackets the below regex should work:
(\[(?:\[??[^\[]*?\]))
This should work out ok:
\[([^]]+)\]
Can brackets be nested?
If not: \[([^]]+)\] matches one item, including square brackets. Backreference \1 will contain the item to be match. If your regex flavor supports lookaround, use
(?<=\[)[^]]+(?=\])
This will only match the item inside brackets.
To match a substring between the first [ and last ], you may use
\[.*\] # Including open/close brackets
\[(.*)\] # Excluding open/close brackets (using a capturing group)
(?<=\[).*(?=\]) # Excluding open/close brackets (using lookarounds)
See a regex demo and a regex demo #2.
Use the following expressions to match strings between the closest square brackets:
Including the brackets:
\[[^][]*] - PCRE, Python re/regex, .NET, Golang, POSIX (grep, sed, bash)
\[[^\][]*] - ECMAScript (JavaScript, C++ std::regex, VBA RegExp)
\[[^\]\[]*] - Java, ICU regex
\[[^\]\[]*\] - Onigmo (Ruby, requires escaping of brackets everywhere)
Excluding the brackets:
(?<=\[)[^][]*(?=]) - PCRE, Python re/regex, .NET (C#, etc.), JGSoft Software
\[([^][]*)] - Bash, Golang - capture the contents between the square brackets with a pair of unescaped parentheses, also see below
\[([^\][]*)] - JavaScript, C++ std::regex, VBA RegExp
(?<=\[)[^\]\[]*(?=]) - Java regex, ICU (R stringr)
(?<=\[)[^\]\[]*(?=\]) - Onigmo (Ruby, requires escaping of brackets everywhere)
NOTE: * matches 0 or more characters, use + to match 1 or more to avoid empty string matches in the resulting list/array.
Whenever both lookaround support is available, the above solutions rely on them to exclude the leading/trailing open/close bracket. Otherwise, rely on capturing groups (links to most common solutions in some languages have been provided).
If you need to match nested parentheses, you may see the solutions in the Regular expression to match balanced parentheses thread and replace the round brackets with the square ones to get the necessary functionality. You should use capturing groups to access the contents with open/close bracket excluded:
\[((?:[^][]++|(?R))*)] - PHP PCRE
\[((?>[^][]+|(?<o>)\[|(?<-o>]))*)] - .NET demo
\[(?:[^\]\[]++|(\g<0>))*\] - Onigmo (Ruby) demo
If you do not want to include the brackets in the match, here's the regex: (?<=\[).*?(?=\])
Let's break it down
The . matches any character except for line terminators. The ?= is a positive lookahead. A positive lookahead finds a string when a certain string comes after it. The ?<= is a positive lookbehind. A positive lookbehind finds a string when a certain string precedes it. To quote this,
Look ahead positive (?=)
Find expression A where expression B follows:
A(?=B)
Look behind positive (?<=)
Find expression A where expression B
precedes:
(?<=B)A
The Alternative
If your regex engine does not support lookaheads and lookbehinds, then you can use the regex \[(.*?)\] to capture the innards of the brackets in a group and then you can manipulate the group as necessary.
How does this regex work?
The parentheses capture the characters in a group. The .*? gets all of the characters between the brackets (except for line terminators, unless you have the s flag enabled) in a way that is not greedy.
Just in case, you might have had unbalanced brackets, you can likely design some expression with recursion similar to,
\[(([^\]\[]+)|(?R))*+\]
which of course, it would relate to the language or RegEx engine that you might be using.
RegEx Demo 1
Other than that,
\[([^\]\[\r\n]*)\]
RegEx Demo 2
or,
(?<=\[)[^\]\[\r\n]*(?=\])
RegEx Demo 3
are good options to explore.
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
RegEx Circuit
jex.im visualizes regular expressions:
Test
const regex = /\[([^\]\[\r\n]*)\]/gm;
const str = `This is a [sample] string with [some] special words. [another one]
This is a [sample string with [some special words. [another one
This is a [sample[sample]] string with [[some][some]] special words. [[another one]]`;
let m;
while ((m = regex.exec(str)) !== null) {
// This is necessary to avoid infinite loops with zero-width matches
if (m.index === regex.lastIndex) {
regex.lastIndex++;
}
// The result can be accessed through the `m`-variable.
m.forEach((match, groupIndex) => {
console.log(`Found match, group ${groupIndex}: ${match}`);
});
}
Source
Regular expression to match balanced parentheses
(?<=\[).*?(?=\]) works good as per explanation given above. Here's a Python example:
import re
str = "Pagination.go('formPagination_bottom',2,'Page',true,'1',null,'2013')"
re.search('(?<=\[).*?(?=\])', str).group()
"'formPagination_bottom',2,'Page',true,'1',null,'2013'"
The #Tim Pietzcker's answer here
(?<=\[)[^]]+(?=\])
is almost the one I've been looking for. But there is one issue that some legacy browsers can fail on positive lookbehind.
So I had to made my day by myself :). I manged to write this:
/([^[]+(?=]))/g
Maybe it will help someone.
console.log("this is a [sample] string with [some] special words. [another one]".match(/([^[]+(?=]))/g));
if you want fillter only small alphabet letter between square bracket a-z
(\[[a-z]*\])
if you want small and caps letter a-zA-Z
(\[[a-zA-Z]*\])
if you want small caps and number letter a-zA-Z0-9
(\[[a-zA-Z0-9]*\])
if you want everything between square bracket
if you want text , number and symbols
(\[.*\])
This code will extract the content between square brackets and parentheses
(?:(?<=\().+?(?=\))|(?<=\[).+?(?=\]))
(?: non capturing group
(?<=\().+?(?=\)) positive lookbehind and lookahead to extract the text between parentheses
| or
(?<=\[).+?(?=\]) positive lookbehind and lookahead to extract the text between square brackets
In R, try:
x <- 'foo[bar]baz'
str_replace(x, ".*?\\[(.*?)\\].*", "\\1")
[1] "bar"
([[][a-z \s]+[]])
Above should work given the following explaination
characters within square brackets[] defines characte class which means pattern should match atleast one charcater mentioned within square brackets
\s specifies a space
 + means atleast one of the character mentioned previously to +.
I needed including newlines and including the brackets
\[[\s\S]+\]
If someone wants to match and select a string containing one or more dots inside square brackets like "[fu.bar]" use the following:
(?<=\[)(\w+\.\w+.*?)(?=\])
Regex Tester

Regex for anything between []

I need to find the regex for []
For eg, if the string is - Hi [Stack], Here is my [Tag] which i need to [Find].
It should return
Stack, Tag, Find
Pretty simple, you just need to (1) escape the brackets with backslashes, and (2) use (.*?) to capture the contents.
\[(.*?)\]
The parentheses are a capturing group, they capture their contents for later use. The question mark after .* makes the matching non-greedy. This means it will match the shortest match possible, rather than the longest one. The difference between greedy and non-greedy comes up when you have multiple matches in a line:
Hi [Stack], Here is my [Tag] which i need to [Find].
^______________________________________________^
A greedy match will find the longest string possible between two sets of square brackets. That's not right. A non-greedy match will find the shortest:
Hi [Stack], Here is my [Tag] which i need to [Find].
^_____^
Anyways, the code will end up looking like:
string regex = #"\[(.*?)\]";
string text = "Hi [Stack], Here is my [Tag] which i need to [Find].";
foreach (Match match in Regex.Matches(text, regex))
{
Console.WriteLine("Found {0}", match.Groups[1].Value);
}
\[([\w]+?)\]
should work. You might have to change the matching group if you need to include special chars as well.
Depending on what environment you mean:
\[([^\]]+)]
.NET syntax, taking care of multiple embedded brackets:
\[ ( (?: \\. | (?<OPEN> \[) | (?<-OPEN> \]) | [^\]] )*? (?(OPEN)(?!)) ) \]
This counts the number of opened [ sections in OPEN and only succeeds if OPEN is 0 in the end.
I encountered a similar issue and discovered that this also does the trick.
\[\w{1,}\]
The \w means Metacharacter. This will match 1 or more word characters.
Using n{X,} quantifier matches any string where you can obtain different amounts. With the second number left out on purpose, the expression means 1 or more characters to match.

Resources