Capturing Group in R - r

I have then following pattern Set(?:Value)? in R as follows:
grepl('Set(?:Value)?', 'Set(Value)', perl=T)
this pattern is macthed by
1- Set
2- Set Value
3- Set(Value)
But I want to match only for two first cases and for for third case. Can anybody help me?
Thank you

You can use
grepl('^Set(?:\\s+Value)?$', x)
grepl('\\bSet(?!\\(Value\\))(?:\\s+Value)?\\b', x, perl=TRUE)
See regex demo #1 and regex demo #2.
Details:
^Set(?:\\s+Value)?$ - start of string, Set, an optional sequence of one or more whitespaces (\s+) and a Value and then end of string
\bSet(?!\(Value\))(?:\s+Value)?\b:
\b - word boundary
Set - Set string
(?!\(Value\)) - no (Value) string allowed at this very location
(?:\s+Value)? - an optional sequence of one or more whitespaces (\s+) and a Value
\b - word boundary
See an R demo:
x <- c("Set", "Set Value", "Set(Value)")
grep('^Set(?:\\s+Value)?$', x, value=TRUE)
## => [1] "Set" "Set Value"
grep('\\bSet(?!\\(Value\\))(?:\\s+Value)?\\b', x, perl=TRUE, value=TRUE)
## => [1] "Set" "Set Value"

Related

Positive Lookbehind and Lookahead to the end of string

My string patterns looks like this:
UNB+UNOC:3+4399945681577+_GLN_Company__+180101:0050+10870 and I am trying to extract everything after the second last +, i.e. 180101:0050+10870.
Thus far, I managed to address the second last block 180101:0050 with this expression (?<=\+)[^\+]+(?=\+[^\+]*$) but fail to include the last block including the last +. Here is my sample: regex101
The expression is meant for R and I still need to escape the characters later on. This format it just for testing purposes in Regex101.
We could capture group based on the occurrence of + from the end ($) of the string.
sub(".*\\+([^+]+\\+[^+]+$)", "\\1", str1)
#[1] "180101:0050+10870"
data
str1 <- "UNB+UNOC:3+4399945681577+_GLN_Company__+180101:0050+10870"
You may use
\+\K[^+]+\+[^+]*$
Or, if you would like to use it with stringr::str_extract:
(?<=\+)[^+]+\+[^+]*$
See the regex demo. Details:
\+ - a + char
\K - match reset operator
(?<=\+) - location right after a + symbol
[^+]+ - one or more chars other than +
\+ - a +
[^+]+ - one or more chars other than +
$ - end of string.
See R demo online:
x <- "UNB+UNOC:3+4399945681577+_GLN_Company__+180101:0050+10870"
regmatches(x, regexpr("\\+\\K[^+]+\\+[^+]*$", x, perl=TRUE))
## => [1] "180101:0050+10870"
library(stringr)
str_extract(x, "(?<=\\+)[^+]+\\+[^+]*$")
## => [1] "180101:0050+10870"
Another way you can do in this case:
library(stringr)
str_extract("UNB+UNOC:3+4399945681577+_GLN_Company__+180101:0050+10870", "\\d+:\\d+\\+\\d+")
#"180101:0050+10870"

Extract substring between two colons / special characters

I want to extract "SUBSTRING" with sub() from the following string:
attribute <- "S4Q7b1_t1_r1: SUBSTRING: some explanation: some explanation - ..."
I used the following code, but unfortunately it didn't work:
sub(".*: (.*) : .*", "\\1", attribute)
Does anyone know an answer for that?
You may use
sub("^[^:]*: ([^:]*).*", "\\1", attribute)
See the regex demo
You need to rely on negated character classes, [^:] that matches any char but :, since .* matches greedily any 0 or more chars. Also, your pattern contains a space before : and it is missing in the string.
Details
^ - start of string
[^:]* - any 0+ chars other than :
: - a colon with a space
-([^:]*) - Capturing group 1 (\1 refers to this value): any 0+ chars other than :
.* - the rest of the string.
R Demo:
attribute <- "S4Q7b1_t1_r1: SUBSTRING: some explanation: some explanation - ..."
sub("^[^:]*: ([^:]*).*", "\\1", attribute)
## => [1] "SUBSTRING"

Remove hashtags from beginning and end of tweets in R

I am trying to remove hashtags from beginning of strings in R.
For example:
x<- "I didn't know it could be #boring. guess I need some fun #movie #lateNightThoughts"
I want to remove the hashtags at the end of string which are #lateNightThoughts and #movie. Result:
- "I didn't know it could be #boring. guess I need some fun"
I tried :
stringi::stri_replace_last_regex(x,'#\\S+',"")
but it removes only the very last hashtag.
- "I didn't know it could be #boring. guess I need some fun #movie "
Any idea how to get the expected result?
Edit:
How about removing hashtag from beginning of text ?
eg:
x<- "#Thomas20 I didn't know it could be #boring. guess I need some fun #movie #lateNightThoughts"
You may use
> x<- "I didn't know it could be #boring. guess I need some fun #movie #lateNightThoughts"
> sub("\\s*\\B#\\w+(?:\\s*#\\w+)*\\s*$", "", x)
[1] "I didn't know it could be #boring. guess I need some fun"
Or, if you do not care about the context of the first # you want to start matching from, you may even use
sub("(?:\\s*#\\w+)+\\s*$", "", x)
See the regex demo.
Details
\s* - zero or more whitespaces
\B - right before the current location, there can be start of string or a non-word char (this is usually used to ensure you do not match # inside a "word", so if you do not need it, you may remove this non-word boundary)
# - a # char
\w+ - 1 or more word chars (letters, digits or _)
(?:\s*#\w+)* - zero or more occurrences of:
\s* - zero or more whitespaces
# - a # char
\w+ - 1+ word chars
\s* - zero or more whitespaces
$ - end of string.

matching start of a string but not end in R

How can I match all words starting with plan_ and not ending with template without using invert = TRUE? In the below example, I'd like to match only the second string. I tried with negative lookahead but it does not work, maybe because of greediness?
names <- c("plan_x_template", "plan_x")
grep("^plan.*(?!template)$",
names,
value = TRUE, perl = TRUE
)
#> [1] "plan_x_template" "plan_x"
I mean one can also solve the problem with two regex calls but I'd like to see how it works the other way :-)
is_plan <- grepl("^plan_", names)
is_template <- grepl("_template$", names)
names[is_plan & !is_template]
#> [1] "plan_x"
You may use
names <- c("plan_x_template", "plan_x")
grep("^plan(?!.*template)",
names,
value = TRUE, perl = TRUE
)
See the R online demo
The ^plan(?!.*template) pattern matches:
^ - a start of string
plan - a plan substring
(?!.*template) - a negative lookahead that fails the match if, immediately to the left of the current location, there are 0+ chars other than line break chars (since perl = TRUE is used and the pattern is processed with a PCRE engine, the . does not match all possible chars as opposed to the default grep TRE regex engine), as many as possible, followed with template substring.
NOTE: In case of multiline strings, you need to use a DOTALL modifier in the regex, "(?s)^plan(?!.*template)".

Extract digits after matching the certain string second time

I want to extract the digits after second occurance of under score _ from a pattern.
by following the similar posts here
Matching different digits after a lookahead
regex - return all before the second occurrence
I tried
library(stringr)
pattern <- c("1/2/3_500k/855kk_1400k/AVBB")
str_extract(pattern, "(^_){2}(\\d+\\.*\\d*)")
which outputs
[1] NA
instead of 1400. Could you help?
You may use a base R solution with regexpr/regmatches:
regmatches(x, regexpr("^(?:[^_]*_){2}[^_0-9]*\\K\\d*\\.?\\d+", x, perl=TRUE))
Or, with sub:
sub("^(?:[^_]*_){2}[^_0-9]*(\\d*\\.?\\d+).*", "\\1", x)
See the R demo online.
The regex is
^(?:[^_]*_){2}[^_0-9]*\K\d*\.?\d+
See the online regex demo.
Details
^ - start of string
(?:[^_]*_){2} - 2 repetitions of
[^_]* - any 0+ chars other than _
_ - an underscore
[^_0-9]* - any 0+ chars other than _ and digits
\K - match reset operator discarding all text matched so far
\d*\.?\d+ - a float or integer number pattern (0+ digits, an optional . and then 1+ digits).
In the sub regex variation, the \K is not necessary, the number pattern is captured into a capturing group and the rest of string is matched with .* pattern. The result is the contents of Group 1, referred to with the \1 placeholder.
One option could be as:
pattern <- c("1/2/3_500k/855kk_1400k/AVBB")
sub(".*_*_(\\d+).*","\\1", pattern, perl = TRUE)
[1] "1400"
The regex is:
".*_*_(\\d+).*"
Details:
.*_ anything before first _
.*_ anything after first _ and before 2nd _
\\d+ look for digits and take those as selection.
.* anything afterwards.
\\1 replaces matching strings with values found for 1st group.

Resources