Capturing Group in R

Capturing Group in R - r

I have then following pattern Set(?:Value)? in R as follows:
grepl('Set(?:Value)?', 'Set(Value)', perl=T)
this pattern is macthed by
1- Set
2- Set Value
3- Set(Value)
But I want to match only for two first cases and for for third case. Can anybody help me?
Thank you

You can use
grepl('^Set(?:\\s+Value)?$', x)
grepl('\\bSet(?!\\(Value\\))(?:\\s+Value)?\\b', x, perl=TRUE)
See regex demo #1 and regex demo #2.
Details:
^Set(?:\\s+Value)?$ - start of string, Set, an optional sequence of one or more whitespaces (\s+) and a Value and then end of string
\bSet(?!\(Value\))(?:\s+Value)?\b:
\b - word boundary
Set - Set string
(?!\(Value\)) - no (Value) string allowed at this very location
(?:\s+Value)? - an optional sequence of one or more whitespaces (\s+) and a Value
\b - word boundary
See an R demo:
x <- c("Set", "Set Value", "Set(Value)")
grep('^Set(?:\\s+Value)?$', x, value=TRUE)
## => [1] "Set" "Set Value"
grep('\\bSet(?!\\(Value\\))(?:\\s+Value)?\\b', x, perl=TRUE, value=TRUE)
## => [1] "Set" "Set Value"

Related

Positive Lookbehind and Lookahead to the end of string

My string patterns looks like this:
UNB+UNOC:3+4399945681577+_GLN_Company__+180101:0050+10870 and I am trying to extract everything after the second last +, i.e. 180101:0050+10870.
Thus far, I managed to address the second last block 180101:0050 with this expression (?<=\+)[^\+]+(?=\+[^\+]*$) but fail to include the last block including the last +. Here is my sample: regex101
The expression is meant for R and I still need to escape the characters later on. This format it just for testing purposes in Regex101.

We could capture group based on the occurrence of + from the end ($) of the string.
sub(".*\\+([^+]+\\+[^+]+$)", "\\1", str1)
#[1] "180101:0050+10870"
data
str1 <- "UNB+UNOC:3+4399945681577+_GLN_Company__+180101:0050+10870"

You may use
\+\K[^+]+\+[^+]*$
Or, if you would like to use it with stringr::str_extract:
(?<=\+)[^+]+\+[^+]*$
See the regex demo. Details:
\+ - a + char
\K - match reset operator
(?<=\+) - location right after a + symbol
[^+]+ - one or more chars other than +
\+ - a +
[^+]+ - one or more chars other than +
$ - end of string.
See R demo online:
x <- "UNB+UNOC:3+4399945681577+_GLN_Company__+180101:0050+10870"
regmatches(x, regexpr("\\+\\K[^+]+\\+[^+]*$", x, perl=TRUE))
## => [1] "180101:0050+10870"
library(stringr)
str_extract(x, "(?<=\\+)[^+]+\\+[^+]*$")
## => [1] "180101:0050+10870"

Another way you can do in this case:
library(stringr)
str_extract("UNB+UNOC:3+4399945681577+_GLN_Company__+180101:0050+10870", "\\d+:\\d+\\+\\d+")
#"180101:0050+10870"

Extract substring between two colons / special characters

I want to extract "SUBSTRING" with sub() from the following string:
attribute <- "S4Q7b1_t1_r1: SUBSTRING: some explanation: some explanation - ..."
I used the following code, but unfortunately it didn't work:
sub(".*: (.*) : .*", "\\1", attribute)
Does anyone know an answer for that?

You may use
sub("^[^:]*: ([^:]*).*", "\\1", attribute)
See the regex demo
You need to rely on negated character classes, [^:] that matches any char but :, since .* matches greedily any 0 or more chars. Also, your pattern contains a space before : and it is missing in the string.
Details
^ - start of string
[^:]* - any 0+ chars other than :
: - a colon with a space
-([^:]*) - Capturing group 1 (\1 refers to this value): any 0+ chars other than :
.* - the rest of the string.
R Demo:
attribute <- "S4Q7b1_t1_r1: SUBSTRING: some explanation: some explanation - ..."
sub("^[^:]*: ([^:]*).*", "\\1", attribute)
## => [1] "SUBSTRING"

Remove hashtags from beginning and end of tweets in R

I am trying to remove hashtags from beginning of strings in R.
For example:
x<- "I didn't know it could be #boring. guess I need some fun #movie #lateNightThoughts"
I want to remove the hashtags at the end of string which are #lateNightThoughts and #movie. Result:
- "I didn't know it could be #boring. guess I need some fun"
I tried :
stringi::stri_replace_last_regex(x,'#\\S+',"")
but it removes only the very last hashtag.
- "I didn't know it could be #boring. guess I need some fun #movie "
Any idea how to get the expected result?
Edit:
How about removing hashtag from beginning of text ?
eg:
x<- "#Thomas20 I didn't know it could be #boring. guess I need some fun #movie #lateNightThoughts"

You may use
> x<- "I didn't know it could be #boring. guess I need some fun #movie #lateNightThoughts"
> sub("\\s*\\B#\\w+(?:\\s*#\\w+)*\\s*$", "", x)
[1] "I didn't know it could be #boring. guess I need some fun"
Or, if you do not care about the context of the first # you want to start matching from, you may even use
sub("(?:\\s*#\\w+)+\\s*$", "", x)
See the regex demo.
Details
\s* - zero or more whitespaces
\B - right before the current location, there can be start of string or a non-word char (this is usually used to ensure you do not match # inside a "word", so if you do not need it, you may remove this non-word boundary)
# - a # char
\w+ - 1 or more word chars (letters, digits or _)
(?:\s*#\w+)* - zero or more occurrences of:
\s* - zero or more whitespaces
# - a # char
\w+ - 1+ word chars
\s* - zero or more whitespaces
$ - end of string.

matching start of a string but not end in R

How can I match all words starting with plan_ and not ending with template without using invert = TRUE? In the below example, I'd like to match only the second string. I tried with negative lookahead but it does not work, maybe because of greediness?
names <- c("plan_x_template", "plan_x")
grep("^plan.*(?!template)$",
names,
value = TRUE, perl = TRUE
)
#> [1] "plan_x_template" "plan_x"
I mean one can also solve the problem with two regex calls but I'd like to see how it works the other way :-)
is_plan <- grepl("^plan_", names)
is_template <- grepl("_template$", names)
names[is_plan & !is_template]
#> [1] "plan_x"

You may use
names <- c("plan_x_template", "plan_x")
grep("^plan(?!.*template)",
names,
value = TRUE, perl = TRUE
)
See the R online demo
The ^plan(?!.*template) pattern matches:
^ - a start of string
plan - a plan substring
(?!.*template) - a negative lookahead that fails the match if, immediately to the left of the current location, there are 0+ chars other than line break chars (since perl = TRUE is used and the pattern is processed with a PCRE engine, the . does not match all possible chars as opposed to the default grep TRE regex engine), as many as possible, followed with template substring.
NOTE: In case of multiline strings, you need to use a DOTALL modifier in the regex, "(?s)^plan(?!.*template)".

Extract digits after matching the certain string second time

I want to extract the digits after second occurance of under score _ from a pattern.
by following the similar posts here
Matching different digits after a lookahead
regex - return all before the second occurrence
I tried
library(stringr)
pattern <- c("1/2/3_500k/855kk_1400k/AVBB")
str_extract(pattern, "(^_){2}(\\d+\\.*\\d*)")
which outputs
[1] NA
instead of 1400. Could you help?

You may use a base R solution with regexpr/regmatches:
regmatches(x, regexpr("^(?:[^_]*_){2}[^_0-9]*\\K\\d*\\.?\\d+", x, perl=TRUE))
Or, with sub:
sub("^(?:[^_]*_){2}[^_0-9]*(\\d*\\.?\\d+).*", "\\1", x)
See the R demo online.
The regex is
^(?:[^_]*_){2}[^_0-9]*\K\d*\.?\d+
See the online regex demo.
Details
^ - start of string
(?:[^_]*_){2} - 2 repetitions of
[^_]* - any 0+ chars other than _
_ - an underscore
[^_0-9]* - any 0+ chars other than _ and digits
\K - match reset operator discarding all text matched so far
\d*\.?\d+ - a float or integer number pattern (0+ digits, an optional . and then 1+ digits).
In the sub regex variation, the \K is not necessary, the number pattern is captured into a capturing group and the rest of string is matched with .* pattern. The result is the contents of Group 1, referred to with the \1 placeholder.

One option could be as:
pattern <- c("1/2/3_500k/855kk_1400k/AVBB")
sub(".*_*_(\\d+).*","\\1", pattern, perl = TRUE)
[1] "1400"
The regex is:
".*_*_(\\d+).*"
Details:
.*_ anything before first _
.*_ anything after first _ and before 2nd _
\\d+ look for digits and take those as selection.
.* anything afterwards.
\\1 replaces matching strings with values found for 1st group.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Capturing Group in R - r

I have then following pattern Set(?:Value)? in R as follows: grepl('Set(?:Value)?', 'Set(Value)', perl=T) this pattern is macthed by 1- Set 2- Set Value 3- Set(Value) But I want to match only for two first cases and for for third case. Can anybody help me? Thank you

Related

Positive Lookbehind and Lookahead to the end of string

Extract substring between two colons / special characters

Remove hashtags from beginning and end of tweets in R

matching start of a string but not end in R

Extract digits after matching the certain string second time

Categories

Resources