R Regex capture group? - r

I have a lot of strings like this:
2019/01/01/07/556662_cba3a4fc-cb8f-4150-859f-5f21a38373d0
I want to extract the substring that lays right after the last "/" and ends with "_":
556662
I have found out how to extract: /01/01/07/556662
by using the following regex: (\/)(.*?)(?=\_)
Please advise how can I capture the right group.

You may use
x <- "2019/01/01/07/556662_cba3a4fc-cb8f-4150-859f-5f21a38373d0"
regmatches(x, regexpr(".*/\\K[^_]+", x, perl=TRUE))
## [1] "556662"
See the regex and R demo.
Here, the regex matches and outputs the first substring that matches
.*/ - any 0+ chars as many as possible up to the last /
\K - omits this part from the match
[^_]+ - puts 1 or more chars other than _ into the match value.
Or, a sub solution:
sub(".*/([^_]+).*", "\\1", x)
See the regex demo.
Here, it is similar to the previous one, but the 1 or more chars other than _ are captured into Group 1 (\1 in the replacement pattern) and the trailing .* make sure the whole input is matched (and consumed, ready to be replaced).
Alternative non-base R solutions
If you can afford or prefer to work with stringi, you may use
library(stringi)
stri_match_last_regex("2019/01/01/07/556662_cba3a4fc-cb8f-4150-859f-5f21a38373d0", ".*/([^_]+)")[,2]
## [1] "556662"
This will match a string up to the last / and will capture into Group 1 (that you access in Column 2 using [,2]) 1 or more chars other than _.
Or
stri_extract_last_regex("2019/01/01/07/556662_cba3a4fc-cb8f-4150-859f-5f21a38373d0", "(?<=/)[^_/]+")
## => [1] "556662"
This will extract the last match of a string that consists of 1 or more chars other than _ and / after a /.

You could use a capturing group:
/([^_/]+)_[^/\s]*
Explanation
/ Match literally
([^_/]+) Capture in a group matching not an underscore or forward slash
_[^/\s]* Match _ and then 0+ times not a forward slash or a whitespace character
Regex demo | R demo
One option to get the capturing group might be to get the second column using str_match:
library(stringr)
str = c("2019/01/01/07/556662_cba3a4fc-cb8f-4150-859f-5f21a38373d0")
str_match(str, "/([^_/]+)_[^/\\s]*")[,2]
# [1] "556662"

I changed the Regex rules according to the code of Wiktor Stribiżew.
x <- "2019/01/01/07/556662_cba3a4fc-cb8f-4150-859f-5f21a38373d0"
regmatches(x, regexpr(".*/([0-9]+)", x, perl=TRUE))
sub(".*/([0-9]+).*", "\\1", x)
Output
[1] "2019/01/01/07/556662"
[1] "556662"
R demo

Related

How to replace `qux$foo$bar` with `qux[["foo"]][["bar]]`? (dollar subsetting to brackets subsetting)

In a R script (that I would read with readLines), I want to replace every occurence of qux$foo$bar with qux[["foo"]][["bar]]. But I'm not a regex master.
I started with this regex:
> gsub("(\\w*)(\\$)(\\w*)", '\\1[["\\3"]]', "qux$foo$bar; input$test$a$a") %>% cat
qux[["foo"]][["bar"]]; input[["test"]][["a"]][["a"]]
Nice. But I also want to handle the case of backticks. So I tried:
> gsub("(\\w*)(\\$)`{0,1}(\\w*)`{0,1}", '\\1[["\\3"]]', "qux$`foo`; bar$`baz`; x$uvw") %>% cat
qux[["foo"]]; bar[["baz"]]; x[["uvw"]]
Looks correct. But between the backticks, there could be a space, and the previous way does not work in this case. So I tried the following, which neither does not work:
gsub("(\\w*)(\\$)`{0,1}(.*)`{0,1}", '\\1[["\\3"]]', "qux$`fo o`") %>% cat
qux[["fo o`"]]
Could you help to find the right regex pattern? It seems that instead of \\w I need something which means match a "word that can contain spaces".
You can use
gsub('(\\w*)(?|\\$`([^`]*)`|\\$([^\\s$]+))', '\\1[["\\2"]]', x, perl=TRUE)
## Or
gsub('\\$`([^`]*)`|\\$([^\\s$]+)', '[["\\1\\2"]]', x, perl=TRUE)
See the regex #1 demo and regex #2 demo. Details:
(\w*) - Group 1 (\1): zero or more word chars
(?|$`([^`]*)`|$([^\s$]+)) - a branch reset group matching either
$`([^`]*)` - $, backtick, Group 2 (\2) capturing zero or more non-backtick chars, and a backtick.
| - or
$([^\s$]+) - $, then Group 2 (\2) capturing one or more chars other than whitespace and $
See the R demo:
x <- c('qux$foo$bar','qux$foo$bar; input$test$a$a','qux$`foo`; bar$`baz`; x$uvw','qux$`fo o`', 'q_ux$f_o_o$b.a_r')
gsub('(\\w*)(?|\\$`([^`]*)`|\\$([^\\s$]+))', '\\1[["\\2"]]', x, perl=TRUE)
## Or
## gsub('\\$`([^`]*)`|\\$([^\\s$]+)', '[["\\1\\2"]]', x, perl=TRUE)
Output:
[1] "qux[[\"foo\"]][[\"bar\"]]"
[2] "qux[[\"foo\"]][[\"bar;\"]] input[[\"test\"]][[\"a\"]][[\"a\"]]"
[3] "qux[[\"foo\"]]; bar[[\"baz\"]]; x[[\"uvw\"]]"
[4] "qux[[\"fo o\"]]"
[5] "q_ux[[\"f_o_o\"]][[\"b.a_r\"]]"
Note: backslashes in the output are console artifacts to keep the double quoted strings valid string literals, they are not part of the plain text output.
You might repeat optional spaces before and after matching 1 or more word characters.
You don't need a capture group for the $ but instead you could use a capture group to pair up the backtick in case it is there or not using a backreference to group 2.
To repeat 0+ whitespace chars you can also use \s but that could also match a newline.
Note that \w* matches optional word chars, and {0,1} can be written as ?
(\w*)\$(`?)( *\w+(?: +\w+)* *)\2
The pattern matches:
(\w*) Capture group 1 Match optional word characters
\$ Match $
(`?) Capture group 2, optionally match a backtick
( *\w+(?: +\w+)* *) Capture group 3 Match repetitions of word characters between spaces
\2 Backreference to what is captured in group 2 (yes or no backtick)
Regex demo
gsub("(\\w*)\\$(`?)( *\\w+(?: +\\w+)* *)\\2", '\\1[["\\3"]]', "qux$fo o$bar", perl=TRUE)
Output
[1] "qux[[\"fo o\"]][[\"bar\"]]"

Can't figure out why regex group is not working in str_match

I have the following code with a regex
CHARACTER <- ^([A-Z0-9 .])+(?:\\(V\\.O\\.\\))?(?:\\(O\\.S\\.\\))?(?:\\(CONT'D\\))?$
str_match("WILL (V.O.)",CHARACTER)[1,2]
I thought this should match the value of "WILL " but it is returning blank.
I tried the RegEx in a different language and the group is coming back blank in that instance also.
What do I have to add to this regex to pull back just the value "WILL"?
You formed a repeated capturing group by placing + outside a group. Put it back:
CHARACTER <- "^([A-Z0-9 .]+)(?:\\(V\\.O\\.\\))?(?:\\(O\\.S\\.\\))?(?:\\(CONT'D\\))?$"
^
Note you may trim Will if you use a lazy match with \s* after the group:
CHARACTER <- "^([A-Z0-9\\s.]+?)\\s*(?:\\(V\\.O\\.\\))?(?:\\(O\\.S\\.\\))?(?:\\(CONT'D\\))?$"
See this regex demo.
> library(stringr)
> CHARACTER <- "^([A-Z0-9\\s.]+?)\\s*(?:\\(V\\.O\\.\\))?(?:\\(O\\.S\\.\\))?(?:\\(CONT'D\\))?$"
> str_match("WILL (V.O.)",CHARACTER)[1,2]
[1] "WILL"
Alternatively, you may just extract Will with
> str_extract("WILL (V.O.)", "^.*?(?=\\s*(?:\\(V\\.O\\.\\))?(?:\\(O\\.S\\.\\))?(?:\\(CONT'D\\))?$)")
[1] "WILL"
Or the same with base R:
> regmatches(x, regexpr("^.*?(?=\\s*(?:\\(V\\.O\\.\\))?(?:\\(O\\.S\\.\\))?(?:\\(CONT'D\\))?$)", x, perl=TRUE))
[1] "WILL"
Here,
^ - matches the start of a string
.*? - any 0+ chars other than line break chars as few as possible
(?=\\s*(?:\\(V\\.O\\.\\))?(?:\\(O\\.S\\.\\))?(?:\\(CONT'D\\))?$) - makes sure that, immediately to the right of the current location, there is
\\s* - 0+ whitespaces
(?:\\(V\\.O\\.\\))? - an optional (V.O.) substring
(?:\\(O\\.S\\.\\))? - an optional (O.S.) substring
(?:\\(CONT'D\\))? - an optional (CONT'D) substring
$ - end of string.

Extract a year number from a string that is surrounded by special characters

What's a good way to extract only the number 2007 from the following string:
some_string <- "1_2_start_2007_3_end"
The pattern to detect the year number in my case would be:
4 digits
surrounded by "_"
I am quite new to using regular expressions. I tried the following:
regexp <- "_+[0-9]+_"
names <- str_extract(files, regexp)
But this does not take into account that there are always 4 digits and outputs the underlines as well.
You may use a sub option, too:
some_string <- "1_2_start_2007_3_end"
sub(".*_(\\d{4})_.*", "\\1", some_string)
See the regex demo
Details
.* - any 0+ chars, as many as possible
_ - a _ char
(\\d{4}) - Group 1 (referred to via \1 from the replacement pattern): 4 digits
_.* - a _ and then any 0+ chars up to the end of string.
NOTE: akrun's str_extract(some_string, "(?<=_)\\d{4}") will extract the leftmost occurrence and my sub(".*_(\\d{4})_.*", "\\1", some_string) will extract the rightmost occurrence of a 4-digit substring enclosed with _. For my my solution to return the leftmost one use a lazy quantifier with the first .: sub(".*?_(\\d{4})_.*", "\\1", some_string).
R test:
some_string <- "1_2018_start_2007_3_end"
sub(".*?_(\\d{4})_.*", "\\1", some_string) # leftmost
## -> 2018
sub(".*_(\\d{4})_.*", "\\1", some_string) # rightmost
## -> 2007
We can use regex lookbehind to specify the _ and extract the 4 digits that follow
library(stringr)
str_extract(some_string, "(?<=_)\\d{4}")
#[1] "2007"
If the pattern also shows - both before and after the 4 digits, then use regex lookahead as well
str_extract(some_string, "(?<=_)\\d{4}(?=_)")
#[1] "2007"
Just to get a non-regex approach out there, in which we split on _ and convert to numeric. All non-numbers will be coerced to NA, so we use !is.na to eliminate them. We then use nchar to count the characters, and pull the one with 4.
i1 <- as.numeric(strsplit(some_string, '_')[[1]])
i1 <- i1[!is.na(i1)]
i1[nchar(i1) == 4]
#[1] 2007
This is the quickest regex I could come up with:
\S.*_(\d{4})_\S.*
It means,
any number of non-space characters,
then _
followed by four digits (d{4})
above four digits is your year captured using ()
another _
any other gibberish non space string
Since, you mentioned you're new, please test this and all other answers at https://regex101.com/, pretty good to learn regex, it explains in depth what your regex is actually doing.
If you just care about (year) then below regex is enough:
_(\d{4})_

r: regex for containing pattern with negation

Suppose I have the following two strings and want to use grep to see which match:
business_metric_one
business_metric_one_dk
business_metric_one_none
business_metric_two
business_metric_two_dk
business_metric_two_none
And so on for various other metrics. I want to only match the first one of each group (business_metric_one and business_metric_two and so on). They are not in an ordered list so I can't index and have to use grep. At first I thought to do:
.*metric.*[^_dk|^_none]$
But this doesn't seem to work. Any ideas?
You need to use a PCRE pattern to filter the character vector:
x <- c("business_metric_one","business_metric_one_dk","business_metric_one_none","business_metric_two","business_metric_two_dk","business_metric_two_none")
grep("metric(?!.*_(?:dk|none))", x, value=TRUE, perl=TRUE)
## => [1] "business_metric_one" "business_metric_two"
See the R demo
The metric(?!.*(?:_dk|_none)) pattern matches
metric - a metric substring
(?!.*_(?:dk|none)) - that is not followed with any 0+ chars other than line break chars followed with _ and then either dk or none.
See the regex demo.
NOTE: if you need to match only such values that contain metric and do not end with _dk or _none, use a variation, metric.*$(?<!_dk|_none) where the (?<!_dk|_none) negative lookbehind fails the match if the string ends with either _dk or _none.
You can also do something like this:
grep("^([[:alpha:]]+_){2}[[:alpha:]]+$", string, value = TRUE)
# [1] "business_metric_one" "business_metric_two"
or use grepl to match dk and none, then negate the logical when you're indexing the original string:
string[!grepl("(dk|none)", string)]
# [1] "business_metric_one" "business_metric_two"
more concisely:
string[!grepl("business_metric_[[:alpha:]]+_(dk|none)", string)]
# [1] "business_metric_one" "business_metric_two"
Data:
string = c("business_metric_one","business_metric_one_dk","business_metric_one_none","business_metric_two","business_metric_two_dk","business_metric_two_none")

extract text between certain characters in R

I need to capture TEST_WF1_CORP[-application-com.ibm.ws.runtime.WsServer] from the following string, basically from - to # sign.
i<-c("Current CPU load - TEST_WF1_CORP[-application-com.ibm.ws.runtime.WsServer]#example1.com")
I've tried this:
str_match(i, ".*-([^\\.]*)\\#.*")[,2]
I am getting NA, any ideas?
1) gsub Replace everything up to and including -, i.e. .* -, and everything after and including #, i.e. #.*, with a zero length string. No packages are needed:
gsub(".* - |#.*", "", i)
## "TEST_WF1_CORP[-application-com.ibm.ws.runtime.WsServer]"
2) sub This would also work. It matches everything to space, minus, space (i.e. .* -) and then captures everything until # (i.e. (.*)# ) followed by whatever is left (.*) and replaces that with the capture group, i.e. the part within parens. It also uses no packages.
sub(".*- (.*)#.*", "\\1", i)
## [1] "TEST_WF1_CORP[-application-com.ibm.ws.runtime.WsServer]"
Note: We used this as input i:
i <- "Current CPU load - TEST_WF1_CORP[-application-com.ibm.ws.runtime.WsServer]#example1.com"
The following should work:
extract <- unlist(strsplit(i,"- |#"))[2]
You may use
-\s*([^#]+)
See the regex demo
Details:
- - a hyphen
\s* - zero or more whitespaces
([^#]+) - Group 1 capturing 1 or more chars other than #.
R demo:
> library(stringr)
> i<-c("Current CPU load - TEST_WF1_CORP[-application-com.ibm.ws.runtime.WsServer]#example1.com")
> str_match(i, "-\\s*([^#]+)")[,2]
[1] "TEST_WF1_CORP[-application-com.ibm.ws.runtime.WsServer]"
The same pattern can be used with base R regmatches/regexec:
> regmatches(i, regexec("-\\s*([^#]+)", i))[[1]][2]
[1] "TEST_WF1_CORP[-application-com.ibm.ws.runtime.WsServer]"
If you prefer a replacing approach you may use a sub:
> sub(".*?-\\s*([^#]+).*", "\\1", i)
[1] "TEST_WF1_CORP[-application-com.ibm.ws.runtime.WsServer]"
Here, .*? matches any 0+ chars, as few as possible, up to the first -, then -, 0+ whitespaces (\\s*), then 1+ chars other than # are captured into Group 1 (see ([^#]+)) and then .* matches the rest of the string. The \1 in the replacement pattern puts the contents of Group 1 back into the replacement result.

Resources