I have got an expression – ]006IRBTS1[ g600 niT erauqS ehcoirB g004 g001 /p 57.01$ hcnuB /p 51.2$
I want to extract the portion in bold. The logic is:
Start with “]“.
Take everything until you get to “[“ including “[“.
Include the next 10 characters/digits whatever it is.
After those 10 characters/digits, include all letters and white spaces
until you hit a digit. Capture the digit and everything that follows until you hit a whitespace.
I am using the following regular expression in R. It doesn’t work of course. Any thoughts?
"^].+\\[.{10}[A-Za-z\\s]+[0-9\\.]+\\s"
1) Start with “]“.
\]
2) Take everything until you get to “[“ including “[“.
[^\[]+\[
3) Include the next 10 characters/digits whatever it is.
.{,10}
4) After those 10 characters/digits, include all letters and white spaces until you hit a digit.
[a-zA-Z\s]+\d
5) Capture the digit and everything that follows until you hit a whitespace.
[^\s]+
Combined:
\][^\[]+\[.{,10}[a-zA-Z\s]+\d[^\s]+
Regex101: https://regex101.com/r/TpoV52/1
UPDATE
I changed the very last quantifier from + to * so it can match some or none more characters.
That is because given "Capture the digit and everything that follows until you hit a whitespace" it is possible that after that digit immediately a whitespace follows. This is the case in the 2nd subject string you gave in your comment:
]006IRBTS1[ g600 niT erauqS ehcoirB g4 g001 /p 57.01$ hcnuB /p 51.2$
The updated pattern, below, will stop at the "capture that digit" (g4) because "and everything that follows until you hit a whitespace" is actually nothing. (Whitespace is next char after digit.)
\][^\[]+\[.{,10}[a-zA-Z\s]+\d[^\s]*
Regex101: https://regex101.com/r/TpoV52/2
Related
I typically skip unwanted text by using (*SKIP)(*F) in my outer-most group:
/(?x)
# Match "quoted text" except when immediately followed by colon:
"(?>[^\\"]+|\\(?s:.))*"[:](*SKIP)(*F)|
"(?>[^\\"]+|\\(?s:.))*"
/
Example
When I move (*SKIP)(*F) into a group, this acts differently:
/(?x)
(?&Text)
(?(DEFINE)
# Match "quoted text" except when immediately followed by colon:
(?<Text>
"(?>[^\\"]+|\\(?s:.))*":(*SKIP)(*F)
|"(?>[^\\"]+|\\(?s:.))*"
)
)/
Example
In the first example, the (*SKIP)(*F) moves the bumpalong to the point in the string after the colon (according to the regex debugger) before trying again. In the second example, the bumpalong appears to only move onto the next character in the string.
What causes this logic to differ?
Update:
Moving the (*SKIP)(*F) block into a DEFINE group appears to work as expected in perl 5.24. It moves the bumpalong the same as if (*SKIP)(*F) were encountered in the outer-most group.
Could this be a bug in the PCRE/PCRE2 implementation?
This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 4 years ago.
I'm trying to understand a regular expression someone has written in the gsub() function.
I've never used regular expressions before seeing this code, and i have tried to work out how it's getting the final result with some googling, but i have hit a wall so to speak.
gsub('.*(.{2}$)', '\\1',"my big fluffy cat")
This code returns the last two characters in the given string. In the above example it would return "at". This is the expected result but from my brief foray into regular expressions i don't understand why this code does what it does.
What i understand is the '.*' means look for any character 0 or more times. So it's going to look at the entire string and this is what will be replaced.
The part in brackets looks for any two characters at the end of the string. It would make more sense to me if this part in brackets was in place of the '\1'. To me it would then read look at the entire string and replace it with the last two characters of that string.
All that does though is output the actual code as the replacement e.g ".{2}$".
Finally i don't understand why '\1' is in the replace part of the function. To me this is just saying replace the entire string with a single backslash and the number one. I say a single backslash because it's my understanding the first backslash is just there to make the second backslash a none special character.
For gsub there are two ways of using the function. The most common way is probably.
gsub("-","TEST","This is a - ")
which would return
This is a TEST
What this does is simply finds the matches in the regular expression and replaces it with the replacement string.
The second way to use gsub is the method in which you described. using \\1, \\2 or \\3...
What this does is looks at the first, second or third capture group in your regular expression.
A capture group is defined by anything inside the circular brackets ex: (capture_group_1)(capture_group_2)...
Explanation
Your analysis is correct.
What i understand is the '.*' means look for any character 0 or more times. So it's going to look at the entire string and this is what will be replaced.
The part in brackets looks for any two characters at the end of the string
The last two characters are placed in a capture group and we are simply replace the whole string with this capture group. Not replacing them with anything.
if it helps, check out the result of this expression.
gsub('(.*)(.{2}$)', 'Group 1: \\1, Group 2: \\2',"my big fluffy cat")
hope the examples can help you to understand it better:
Say we have a string foobarabcabcdef
.* matches whole string.
.*abc it matches: from the beginning matches any chars till the last abc (greedy matching), thus, it matches foobarabcabc
.*(...)$ matches the whole string as well, however, the last 3 chars were groupped. Without the () , the matched string will have a default group, group0, the () will be group1, 2, 3.... think about .*(...)(...)(...)$ so we have:
group 0 : whole string
group 1 : "abc" the first "abc"
group 2 : "abc" the 2nd "abc"
group 3 : "def" the last 3 chars
So back to your example, the \\1 is a reference to group. What it does is: "replace the whole string by the matched text in group1" That is, the .{2}$ part is the replacement.
If you don't understand the backslashs, you have to reference the syntax of r, I cannot tell more. It is all about escaping.
Important part of that regular expression are brackets, that's something called "capturing group".
Regular expression .*(.{2}$) says - match anything and capture last 2 characters at the line. Replacement \\1 is referencing to that group, so it will replace whole match with captured group, which are last two characters in this case.
I want to introduce a backspace character at the beginning of the line where a particular pattern is not found. Please advise.
Thanks,
Sagar
If you mean that you want to "remove the first character" then you can do this:
1) Write your regex pattern of what you want to find. For example, if you want to match Remove me at the start of the line, use:
^R\(emove me\)
Here we use ^ to assert the position to the start of the string. We also capture everything apart from the string we wish to keep in a backreference so it can be used later.
2) Replace the matches we find with whatever we grabbed in our backreference, in this case emove me, in effect backspacing the first character.
3) Make sure regular expression is checked and the cursor is at the start of the file, and hit Replace All.
Before
After:
I have a regular expression
^[a-zA-Z+#-.0-9]{1,5}$
which validates that the word contains alpha-numeric characters and few special characters and length should not be more than 5 characters.
How do I make this regular expression to accept a maximum of five words matching the above regular expression.
^[a-zA-Z+#\-.0-9]{1,5}(\s[a-zA-Z+#\-.0-9]{1,5}){0,4}$
Also, you could use for example [ ] instead of \s if you just want to accept space, not tab and newline. And you could write [ ]+ (or \s+) for any number of spaces (or whitespaces), not just one.
Edit: Removed the invalid solution and fixed the bug mentioned by unicornaddict.
I believe this may be what you're looking for. It forces at least one word of your desired pattern, then zero to four of the same, each preceded by one or more white-space characters:
^XX(\s+XX){0,4}$
where XX is your actual one-word regex.
It's separated into two distinct sections so that you're not required to have white-space at the end of the string. If you want to allow for such white-space, simply add \s* at that point. For example, allowing white-space both at start and end would be:
^\s*XX(\s+XX){0,4}\s*$
You regex has a small bug. It matches letters, digits, +, #, period but not hyphen and also all char between # and period. This is because hyphen in a char class when surrounded on both sides acts as a range meta char. To avoid this you'll have to escape the hyphen:
^[a-zA-Z+#\-.0-9]{1,5}$
Or put it at the beg/end of the char class, so that its treated literally:
^[-a-zA-Z+#-.0-9]{1,5}$
^[a-zA-Z+#.0-9-]{1,5}$
Now to match a max of 5 such words you can use:
^(?:[a-zA-Z+#\-.0-9]{1,5}\s+){1,5}$
EDIT: This solution has a severe limitation of matching only those input that end in white space!!! To overcome this limitation you can see the ans by Jakob.
I have a question about a regex. Given this part of a regex:
(.[^\\.]+)
The part [^\.]+ Does this mean get everything until the first dot? So with this text:
Hello my name is Martijn. I live in Holland.
I get 2 results: both sentences. But when I leave the + sign, I get 2 two characters: he, ll, o<space>, my, etc. Why is that?
Your regex .[^\\.]+ means:
Match any character
Match any character until you get slash or a dot ".". Note that [^\\.] means NOT slash or NOT dot, which means either a dot or a slash is not a match. It will keep on matching characters until it founds a dot or slash because of the "+" at the end. It is called a greedy quantifier because of that.
When you input (quotes not included): "Hello my name is Martijn. I live in Holland."
The matches are:
Hello my name is Martijn
. I live in Holland
Note that the dot is not included in the first match since it stops at n in Martijn and the second match starts with the dot.
When you remove the +: (.[^\\.])
It just means:
Match any character
Match any character except a dot or a slash.
Because a dot outside a character class (ie, not between []) means (almost) any character.
So, .[^\\.] means match (almost) any character followed by something which is not a dot nor a backslash (dots don't need to be escaped in a character class to mean just a dot, but backslashes do),
This, in your example, is h (any character) e (not a dot nor a backslash) and so on and so forth.
Whereas with a + (one or more of not a dot nor a backslash) you will match all characters which are not dots until a dot.
The regex means:
any one character followed by more than zero characters that are not a backslash or a period.