I am studying Vue3 source code recently, one expression makes me really confusing
here it is:
const propertyDelimiterRE = /:(.+)/;
"color:red;".split(propertyDelimiterRE);// ["color", "red;", ""]
I
const propertyDelimiterRE = /:(.+)/;
const parseResult = "color:red;".split(propertyDelimiterRE);
console.log(parseResult)
don't know why is that, please help me thanks
this is more of a split + regex question, but here it goes
The regular expression portion :(.+) has two parts, : and (.+)
: says to watch for the : character literally
(.+) says to capture any character(s) except for line terminators
so together, they will capture :red; (full match) and red; as the capture group.
The second part is that the way [split behaves] (https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/split#description)
When found, separator is removed from the string, and the substrings are returned in an array.
If separator is a regular expression with capturing parentheses, then each time separator matches, the results (including any undefined results) of the capturing parentheses are spliced into the output array.
so togeter...
"color:red;".split( /:(.+)/) will use : and everything after it to split the string
that will be (sort of) equivalent of "color:red;".split(":red;)
which would return ["color", ""]
however, because we're using a split with capture group, it splices the matched capture group into the array, giving us ["color", ":red;", ""]
Related
I have a main string as below
"/tmp/xjtscpdownload/7eb17cc6-b3c9-4ebd-945b-c0e0656a33f0/output/9999.317528060546245771146821638997525068657/"
From the main string i need to extract a substring starting from the uuid part
"/7eb17cc6-b3c9-4ebd-945b-c0e0656a33f0/output/9999.317528060546245771146821638997525068657/"
I tried
string.match("/tmp/xjtscpdownload/7eb17cc6-b3c9-4ebd-945b-c0e0656a33f0/output/9999.317528060546245771146821638997525068657/", "/[a-fA-F0-9]{8}-[a-fA-F0-9]{4}-[a-fA-F0-9]{4}-[a-fA-F0-9]{4}-[a-fA-F0-9]{12}/(.)/(.)/$"
But noluck.
if you want to obtain
"/7eb17cc6-b3c9-4ebd-945b-c0e0656a33f0/output/9999.317528060546245771146821638997525068657/"
from
"/tmp/xjtscpdownload/7eb17cc6-b3c9-4ebd-945b-c0e0656a33f0/output/9999.317528060546245771146821638997525068657/"
or let's say 7eb17cc6-b3c9-4ebd-945b-c0e0656a33f0, output and 9999.317528060546245771146821638997525068657 as this is what your pattern attempt suggests. Otherwise leave out the parenthesis in the following solution.
You can use a pattern like this:
local text = "/tmp/xjtscpdownload/7eb17cc6-b3c9-4ebd-945b-c0e0656a33f0/output/9999.317528060546245771146821638997525068657/"
print(text:match("/([%x%-]+)/([^/]+)/([^/]+)"))
"/([^/]+)/" captures at least one non-slash-character between two slashs.
On your attempt:
You cannot give counts like {4} in a string pattern.
You have to escape - with % as it is a magic character.
(.) would only capture a single character.
Please read the Lua manual to find out what you did wrong and how to use string patterns properly.
Try also the code
s="/tmp/xjtscpdownload/7eb17cc6-b3c9-4ebd-945b-c0e0656a33f0/output/9999.317528060546245771146821638997525068657/"
print(s:match("/.-/.-(/.+)$"))
It skips the first two "fields" by using a non-greedy match.
This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 4 years ago.
I'm trying to understand a regular expression someone has written in the gsub() function.
I've never used regular expressions before seeing this code, and i have tried to work out how it's getting the final result with some googling, but i have hit a wall so to speak.
gsub('.*(.{2}$)', '\\1',"my big fluffy cat")
This code returns the last two characters in the given string. In the above example it would return "at". This is the expected result but from my brief foray into regular expressions i don't understand why this code does what it does.
What i understand is the '.*' means look for any character 0 or more times. So it's going to look at the entire string and this is what will be replaced.
The part in brackets looks for any two characters at the end of the string. It would make more sense to me if this part in brackets was in place of the '\1'. To me it would then read look at the entire string and replace it with the last two characters of that string.
All that does though is output the actual code as the replacement e.g ".{2}$".
Finally i don't understand why '\1' is in the replace part of the function. To me this is just saying replace the entire string with a single backslash and the number one. I say a single backslash because it's my understanding the first backslash is just there to make the second backslash a none special character.
For gsub there are two ways of using the function. The most common way is probably.
gsub("-","TEST","This is a - ")
which would return
This is a TEST
What this does is simply finds the matches in the regular expression and replaces it with the replacement string.
The second way to use gsub is the method in which you described. using \\1, \\2 or \\3...
What this does is looks at the first, second or third capture group in your regular expression.
A capture group is defined by anything inside the circular brackets ex: (capture_group_1)(capture_group_2)...
Explanation
Your analysis is correct.
What i understand is the '.*' means look for any character 0 or more times. So it's going to look at the entire string and this is what will be replaced.
The part in brackets looks for any two characters at the end of the string
The last two characters are placed in a capture group and we are simply replace the whole string with this capture group. Not replacing them with anything.
if it helps, check out the result of this expression.
gsub('(.*)(.{2}$)', 'Group 1: \\1, Group 2: \\2',"my big fluffy cat")
hope the examples can help you to understand it better:
Say we have a string foobarabcabcdef
.* matches whole string.
.*abc it matches: from the beginning matches any chars till the last abc (greedy matching), thus, it matches foobarabcabc
.*(...)$ matches the whole string as well, however, the last 3 chars were groupped. Without the () , the matched string will have a default group, group0, the () will be group1, 2, 3.... think about .*(...)(...)(...)$ so we have:
group 0 : whole string
group 1 : "abc" the first "abc"
group 2 : "abc" the 2nd "abc"
group 3 : "def" the last 3 chars
So back to your example, the \\1 is a reference to group. What it does is: "replace the whole string by the matched text in group1" That is, the .{2}$ part is the replacement.
If you don't understand the backslashs, you have to reference the syntax of r, I cannot tell more. It is all about escaping.
Important part of that regular expression are brackets, that's something called "capturing group".
Regular expression .*(.{2}$) says - match anything and capture last 2 characters at the line. Replacement \\1 is referencing to that group, so it will replace whole match with captured group, which are last two characters in this case.
I would like to achieve this result : "raster(B04) + raster(B02) - raster(A10mB03)"
Therefore, I created this regex: B[0-1][0-9]|A[1,2,6]0m/B[0-1][0-9]"
I am now trying to replace all matches of the string "B04 + B02 - A10mB03" with gsub("B[0-1][0-9]]|[A[1,2,6]0mB[0-1][0-9]", "raster()", string)
How could I include the original values B01, B02, A10mB03?
PS: I also tried gsub("B[0-1][0-9]]|[A[1,2,6]0mB[0-1][0-9]", "raster(\\1)", string) but it did not work.
Basically, you need to match some text and re-use it inside a replacement pattern. In base R regex methods, there is no way to do that without a capturing group, i.e. a pair of unescaped parentheses, enclosing the whole regex pattern in this case, and use a \\1 replacement backreference in the replacement pattern.
However, your regex contains some issues: [A[1,2,6] gets parsed as a single character class that matches A, [, 1, ,, 2 or 6 because you placed a [ before A. Also, note that , inside character classes matches a literal comma, and it is not what you expected. Another, similar issue, is with [0-9]] - it matches any ASCII digit with [0-9] and then a ] (the ] char does not have to be escaped in a regex pattern).
So, a potential fix for you expression can look like
gsub("(B[0-1][0-9]|A[126]0mB[0-1][0-9])", "raster(\\1)", string)
Or even just matching 1 or more word chars (considering the sample string you supplied)
gsub("(\\w+)", "raster(\\1)", string)
might do.
See the R demo online.
I'm a regular expression newbie and I can't quite figure out how to write a single regular expression that would "match" any duplicate consecutive words such as:
Paris in the the spring.
Not that that is related.
Why are you laughing? Are my my regular expressions THAT bad??
Is there a single regular expression that will match ALL of the bold strings above?
Try this regular expression:
\b(\w+)\s+\1\b
Here \b is a word boundary and \1 references the captured match of the first group.
Regex101 example here
I believe this regex handles more situations:
/(\b\S+\b)\s+\b\1\b/
A good selection of test strings can be found here: http://callumacrae.github.com/regex-tuesday/challenge1.html
The below expression should work correctly to find any number of duplicated words. The matching can be case insensitive.
String regex = "\\b(\\w+)(\\s+\\1\\b)+";
Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(input);
// Check for subsequences of input that match the compiled pattern
while (m.find()) {
input = input.replaceAll(m.group(0), m.group(1));
}
Sample Input : Goodbye goodbye GooDbYe
Sample Output : Goodbye
Explanation:
The regex expression:
\b : Start of a word boundary
\w+ : Any number of word characters
(\s+\1\b)* : Any number of space followed by word which matches the previous word and ends the word boundary. Whole thing wrapped in * helps to find more than one repetitions.
Grouping :
m.group(0) : Shall contain the matched group in above case Goodbye goodbye GooDbYe
m.group(1) : Shall contain the first word of the matched pattern in above case Goodbye
Replace method shall replace all consecutive matched words with the first instance of the word.
Try this with below RE
\b start of word word boundary
\W+ any word character
\1 same word matched already
\b end of word
()* Repeating again
public static void main(String[] args) {
String regex = "\\b(\\w+)(\\b\\W+\\b\\1\\b)*";// "/* Write a RegEx matching repeated words here. */";
Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE/* Insert the correct Pattern flag here.*/);
Scanner in = new Scanner(System.in);
int numSentences = Integer.parseInt(in.nextLine());
while (numSentences-- > 0) {
String input = in.nextLine();
Matcher m = p.matcher(input);
// Check for subsequences of input that match the compiled pattern
while (m.find()) {
input = input.replaceAll(m.group(0),m.group(1));
}
// Prints the modified sentence.
System.out.println(input);
}
in.close();
}
Regex to Strip 2+ duplicate words (consecutive/non-consecutive words)
Try this regex that can catch 2 or more duplicate words and only leave behind one single word. And the duplicate words need not even be consecutive.
/\b(\w+)\b(?=.*?\b\1\b)/ig
Here, \b is used for Word Boundary, ?= is used for positive lookahead, and \1 is used for back-referencing.
Example
Source
The widely-used PCRE library can handle such situations (you won't achieve the the same with POSIX-compliant regex engines, though):
(\b\w+\b)\W+\1
Here is one that catches multiple words multiple times:
(\b\w+\b)(\s+\1)+
No. That is an irregular grammar. There may be engine-/language-specific regular expressions that you can use, but there is no universal regular expression that can do that.
This is the regex I use to remove duplicate phrases in my twitch bot:
(\S+\s*)\1{2,}
(\S+\s*) looks for any string of characters that isn't whitespace, followed whitespace.
\1{2,} then looks for more than 2 instances of that phrase in the string to match. If there are 3 phrases that are identical, it matches.
Since some developers are coming to this page in search of a solution which not only eliminates duplicate consecutive non-whitespace substrings, but triplicates and beyond, I'll show the adapted pattern.
Pattern: /(\b\S+)(?:\s+\1\b)+/ (Pattern Demo)
Replace: $1 (replaces the fullstring match with capture group #1)
This pattern greedily matches a "whole" non-whitespace substring, then requires one or more copies of the matched substring which may be delimited by one or more whitespace characters (space, tab, newline, etc).
Specifically:
\b (word boundary) characters are vital to ensure partial words are not matched.
The second parenthetical is a non-capturing group, because this variable width substring does not need to be captured -- only matched/absorbed.
the + (one or more quantifier) on the non-capturing group is more appropriate than * because * will "bother" the regex engine to capture and replace singleton occurrences -- this is wasteful pattern design.
*note if you are dealing with sentences or input strings with punctuation, then the pattern will need to be further refined.
The example in Javascript: The Good Parts can be adapted to do this:
var doubled_words = /([A-Za-z\u00C0-\u1FFF\u2800-\uFFFD]+)\s+\1(?:\s|$)/gi;
\b uses \w for word boundaries, where \w is equivalent to [0-9A-Z_a-z]. If you don't mind that limitation, the accepted answer is fine.
This expression (inspired from Mike, above) seems to catch all duplicates, triplicates, etc, including the ones at the end of the string, which most of the others don't:
/(^|\s+)(\S+)(($|\s+)\2)+/g, "$1$2")
I know the question asked to match duplicates only, but a triplicate is just 2 duplicates next to each other :)
First, I put (^|\s+) to make sure it starts with a full word, otherwise "child's steak" would go to "child'steak" (the "s"'s would match). Then, it matches all full words ((\b\S+\b)), followed by an end of string ($) or a number of spaces (\s+), the whole repeated more than once.
I tried it like this and it worked well:
var s = "here here here here is ahi-ahi ahi-ahi ahi-ahi joe's joe's joe's joe's joe's the result result result";
print( s.replace( /(\b\S+\b)(($|\s+)\1)+/g, "$1"))
--> here is ahi-ahi joe's the result
Try this regular expression it fits for all repeated words cases:
\b(\w+)\s+\1(?:\s+\1)*\b
I think another solution would be to use named capture groups and backreferences like this:
.* (?<mytoken>\w+)\s+\k<mytoken> .*/
OR
.*(?<mytoken>\w{3,}).+\k<mytoken>.*/
Kotlin:
val regex = Regex(""".* (?<myToken>\w+)\s+\k<myToken> .*""")
val input = "This is a test test data"
val result = regex.find(input)
println(result!!.groups["myToken"]!!.value)
Java:
var pattern = Pattern.compile(".* (?<myToken>\\w+)\\s+\\k<myToken> .*");
var matcher = pattern.matcher("This is a test test data");
var isFound = matcher.find();
var result = matcher.group("myToken");
System.out.println(result);
JavaScript:
const regex = /.* (?<myToken>\w+)\s+\k<myToken> .*/;
const input = "This is a test test data";
const result = regex.exec(input);
console.log(result.groups.myToken);
// OR
const regex = /.* (?<myToken>\w+)\s+\k<myToken> .*/g;
const input = "This is a test test data";
const result = [...input.matchAll(regex)];
console.log(result[0].groups.myToken);
All the above detect the test as the duplicate word.
Tested with Kotlin 1.7.0-Beta, Java 11, Chrome and Firefox 100.
You can use this pattern:
\b(\w+)(?:\W+\1\b)+
This pattern can be used to match all duplicated word groups in sentences. :)
Here is a sample util function written in java 17, which replaces all duplications with the first occurrence:
public String removeDuplicates(String input) {
var regex = "\\b(\\w+)(?:\\W+\\1\\b)+";
var pattern = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
var matcher = pattern.matcher(input);
while (matcher.find()) {
input = input.replaceAll(matcher.group(), matcher.group(1));
}
return input;
}
As far as I can see, none of these would match:
London in the
the winter (with the winter on a new line )
Although matching duplicates on the same line is fairly straightforward,
I haven't been able to come up with a solution for the situation in which they
stretch over two lines. ( with Perl )
To find duplicate words that have no leading or trailing non whitespace character(s) other than a word character(s), you can use whitespace boundaries on the left and on the right making use of lookarounds.
The pattern will have a match in:
Paris in the the spring.
Not that that is related.
The pattern will not have a match in:
This is $word word
(?<!\S)(\w+)\s+\1(?!\S)
Explanation
(?<!\S) Negative lookbehind, assert not a non whitespace char to the left of the current location
(\w+) Capture group 1, match 1 or more word characters
\s+ Match 1 or more whitespace characters (note that this can also match a newline)
\1 Backreference to match the same as in group 1
(?!\S) Negative lookahead, assert not a non whitespace char to the right of the current location
See a regex101 demo.
To find 2 or more duplicate words:
(?<!\S)(\w+)(?:\s+\1)+(?!\S)
This part of the pattern (?:\s+\1)+ uses a non capture group to repeat 1 or more times matching 1 or more whitespace characters followed by the backreference to match the same as in group 1.
See a regex101 demo.
Alternatives without using lookarounds
You could also make use of a leading and trailing alternation matching either a whitespace char or assert the start/end of the string.
Then use a capture group 1 for the value that you want to get, and use a second capture group with a backreference \2 to match the repeated word.
Matching 2 duplicate words:
(?:\s|^)((\w+)\s+\2)(?:\s|$)
See a regex101 demo.
Matching 2 or more duplicate words:
(?:\s|^)((\w+)(?:\s+\2)+)(?:\s|$)
See a regex101 demo.
Use this in case you want case-insensitive checking for duplicate words.
(?i)\\b(\\w+)\\s+\\1\\b
I need to find attribute values in an ASPX file using regular expressions.
That means you don't need to worry about malformed HTML or any HTML related issues.
I need to find the value of a particular attribute (LocText). I want to get what's inside the quotes.
Any ASPX tags such as <%=, <%#, <%$ etc. inside the value don't make sense for this attribute therefore are considered as part of it.
The regex I began with looks like this:
LocText="([^"]+)"
This works great, the first group, which is the result text, gets everything except the double quotes, which are not allowed there (" ; must be used instead)
But the ASPX file allows using of single quotes - second regular expression must be applied then.
LocText='([^']+)'
I could use these two regular expressions but I'm looking for a way to connect them.
LocText=("([^"]+)"|'([^']+)')
This also works but doesn't seem very efficient as it's creating unnecessary number of groups. I think this could be somehow done by using backreferences, but I can't get it to work.
LocText=(["']{1})([^\1]+)\1
I thought that by this, I save the single/double quote to the first group and then I tell it to read anything that is NOT the char found in the first group. This is enclosed again by the quote from the first group. Obviously, I'm wrong and it's not working like that.
Is there any way, how to connect the first two expressions together creating just a minimum amount of groups with one group being the value of the attribute I want to get? Is it possible using a backreference for the single/double quote value, or have I completely misunderstood the meaning of them?
I'd say your solution with alternation isn't that bad, but you could use named captures so the result will always be found in the same group's value:
Regex regexObj = new Regex(#"LocText=(?:""(?<attr>[^""]+)""|'(?<attr>[^']+)')");
resultString = regexObj.Match(subjectString).Groups["attr"].Value;
Explanation:
LocText= # Match LocText=
(?: # Either match
"(?<attr>[^"]+)" # "...", capture in named group <attr>
| # or match
'(?<attr>[^']+)' # '...', also capture in named group <attr>
) # End of alternation
Another option would be to use lookahead assertions ([^\1] isn't working because you can't place backreferences inside a character class, but you can use them in lookarounds):
Regex regexObj = new Regex(#"LocText=([""'])((?:(?!\1).)*)\1");
resultString = regexObj.Match(subjectString).Groups[2].Value;
Explanation:
LocText= # Match LocText=
(["']) # Match and capture (group 1) " or '
( # Match and capture (group 2)...
(?: # Try to match...
(?!\1) # (unless it's the quote character we matched before)
. # any character
)* # repeat any number of times
) # End of capturing group 2
\1 # Match the previous quote character