how do you extract values between two characters in R?

how do you extract values between two characters in R? - r

I am trying to extract the server name (server101) from this string in R using regular expression:
value between # and the following first period (.)
t<-c("Current CPU load - jvm machine[example network-app_svc_group_mem4]#server101.example.com")
I've tried this:
gsub('.*\\#(\\d+),(\\d+).*', '\\1', t)
this does not seem to be working, any ideas?

Since you only expect one match, you may use a simple sub here:
t <- "Current CPU load - jvm machine[example network-app_svc_group_mem4]#server101.example.com"
sub(".*#([^.]+)\\..*", "\\1", t)
## => [1] "server101"
See the R demo online.
Details
.* - any 0+ chars, as many as possible
# - a # char
([^.]+) - Group 1 ("\\1"):
\\. - a dot (other chars you need to escape are $, ^, *, (, ), +, [, \, ?)
.* - any 0+ chars, as many as possible
Here are some alternatives.
You may use the following base R code to extract 1+ characters other than . ([^.]+) after the first #:
> t <- "Current CPU load - jvm machine[example network-app_svc_group_mem4]#server101.example.com"
> pattern="#([^.]+)"
> m <- regmatches(t,regexec(pattern,t))
> result = unlist(m)[2]
> result
[1] "server101"
With regexec, you can access submatches (capturing group contents).
See the online R demo
Another way is to use regmatches/regexpr with a PCRE regex with a (?<=#) lookbehind that only checks for the character presence, but does not put the character into the match:
> result2 <- regmatches(t, regexpr("(?<=#)[^.]+", t, perl=TRUE))
> result2
[1] "server101"
A clean stringr approach would be to use the same PCRE regex with str_extract (that uses a similar (because it also supports lookarounds), ICU, regex flavor):
> library(stringr)
> t<-c("Current CPU load - jvm machine[example network-app_svc_group_mem4]#server101.example.com")
> str_extract(t, "(?<=#)[^.]+")
[1] "server101"

with stringr:
library(stringr)
str_match(t, ".*#([^\\.]*)\\..*")[2]
#[1] "server101"

Related

Retain string up to second slash in regex?

I am trying to only retain the string after the first section of characters (which includes - and numerics) but before the forward slash.
I have the following string:
x <- c('/youtube.com/videos/cats', '/google.com/images/dogs', 'bbc.com/movies')
/youtube.com/videos/cats
/google.com/images/dogs
bbc.com/movies
So it would look like this
/youtube.com/
/google.com/
bbc.com/
For reference I am using R 3.6
I have tried positive lookbehinds and the closest I got was this: ^\/[^\/]*
Any help appreciated
So in the bbc.com/movies example - the string does not start with a forward slash / but I still want to be able to keep the bbc.com part during the match

You can use a sub here to only perform a single regex replacement:
sub('^(/?[^/]*/).*', '\\1', x)
See the regex demo.
Details
^ - start of string
-(/?[^/]*/) - Capturing group 1 (\1 in the replacement pattern): an optional /, then 0 or more chars other than / and then a /
.* - any zero or more chars, as many as possible.
See an R test online:
test <- c("/youtube.com/videos/cats", "/google.com/images/dogs", "bbc.com/movies")
sub('^(/?[^/]*/).*', '\\1', test)
# => [1] "/youtube.com/" "/google.com/" "bbc.com/"

First great username. Try this, you can leverage the fact str_extract only pulls the first match out. assuming all urls match letters.letters this pattern should work. Let me know if you have numbers in any of them.
library(stringr)
c("/youtube.com/videos/cats",
"/google.com/images/dogs",
"bbc.com/movies") %>%
str_extract(., "/?\\w+\\.\\w+/")
produces
"/youtube.com/" "/google.com/" "bbc.com/"

Using base R
gsub('(\\/?.*\\.com\\/).*', '\\1', x)
[1] "/youtube.com/" "/google.com/" "bbc.com/"

an alternative would be with the rebus Package:
library(rebus)
library(stringi)
t <- c("/youtube.com/videos/cats"," /google.com/images/dogs"," bbc.com/movie")
pattern <- zero_or_more("/") %R% one_or_more(ALPHA) %R% DOT %R% one_or_more(ALPHA) %R% zero_or_more("/")
stringi::stri_extract_first_regex(t, pattern)
[1] "/youtube.com/" "/google.com/" "bbc.com/"

R Regex capture group?

I have a lot of strings like this:
2019/01/01/07/556662_cba3a4fc-cb8f-4150-859f-5f21a38373d0
I want to extract the substring that lays right after the last "/" and ends with "_":
556662
I have found out how to extract: /01/01/07/556662
by using the following regex: (\/)(.*?)(?=\_)
Please advise how can I capture the right group.

You may use
x <- "2019/01/01/07/556662_cba3a4fc-cb8f-4150-859f-5f21a38373d0"
regmatches(x, regexpr(".*/\\K[^_]+", x, perl=TRUE))
## [1] "556662"
See the regex and R demo.
Here, the regex matches and outputs the first substring that matches
.*/ - any 0+ chars as many as possible up to the last /
\K - omits this part from the match
[^_]+ - puts 1 or more chars other than _ into the match value.
Or, a sub solution:
sub(".*/([^_]+).*", "\\1", x)
See the regex demo.
Here, it is similar to the previous one, but the 1 or more chars other than _ are captured into Group 1 (\1 in the replacement pattern) and the trailing .* make sure the whole input is matched (and consumed, ready to be replaced).
Alternative non-base R solutions
If you can afford or prefer to work with stringi, you may use
library(stringi)
stri_match_last_regex("2019/01/01/07/556662_cba3a4fc-cb8f-4150-859f-5f21a38373d0", ".*/([^_]+)")[,2]
## [1] "556662"
This will match a string up to the last / and will capture into Group 1 (that you access in Column 2 using [,2]) 1 or more chars other than _.
Or
stri_extract_last_regex("2019/01/01/07/556662_cba3a4fc-cb8f-4150-859f-5f21a38373d0", "(?<=/)[^_/]+")
## => [1] "556662"
This will extract the last match of a string that consists of 1 or more chars other than _ and / after a /.

You could use a capturing group:
/([^_/]+)_[^/\s]*
Explanation
/ Match literally
([^_/]+) Capture in a group matching not an underscore or forward slash
_[^/\s]* Match _ and then 0+ times not a forward slash or a whitespace character
Regex demo | R demo
One option to get the capturing group might be to get the second column using str_match:
library(stringr)
str = c("2019/01/01/07/556662_cba3a4fc-cb8f-4150-859f-5f21a38373d0")
str_match(str, "/([^_/]+)_[^/\\s]*")[,2]
# [1] "556662"

I changed the Regex rules according to the code of Wiktor Stribiżew.
x <- "2019/01/01/07/556662_cba3a4fc-cb8f-4150-859f-5f21a38373d0"
regmatches(x, regexpr(".*/([0-9]+)", x, perl=TRUE))
sub(".*/([0-9]+).*", "\\1", x)
Output
[1] "2019/01/01/07/556662"
[1] "556662"
R demo

extracting part of path using sub

I'm attempting to extract a filename from a path in r. In a string like
someurl.com/vp/125514_45147_55144.jpg?_nc25244
I want to extract 125514_45147_55144
I'm using the following expression:
sub(".*vp/(.*?)/.*", "\\1", input)
which works but it also strips the underscores:
1255144514755144
I cannot figure out how to retain the underscores

Remove dot and everything after it of the basename:
sub("\\..*", "", basename(x))
## [1] "125514_45147_55144"
If it is possible that there are dots in the filename then use this slightly more complex pattern:
sub("(.*)\\..*", "\\1", basename(x))
## [1] "125514_45147_55144"

I suggest fixing it as
sub(".*/vp/([^/?]*?)\\.[^/?.]*(?:\\?.*)?$", "\\1", input)
See the regex demo
Details
.* - any 0+ chars as many as possible
/vp/ - a literal substring
([^/?]*?) - Group 1 (its captured value is referenced by \1 from the replacement pattern): any 0+ chars other than / and ?, as few as possible
\\. - a dot
[^/?.]* - 0+ chars other than ., ? and /
(?:\\?.*)? - an optional substring matching ? and then any 0+ chars as many as possible
$ - end of string.
With regmatches/regexec the pattern becomes much clearer:
x <- "someurl.com/vp/125514_45147_55144.jpg?_nc25244"
regmatches(x,regexec("/vp/([^/?]*)\\.",x))[[1]][2]
## => [1] "125514_45147_55144"
See the R demo

stringr alternative
library( stringr )
str_match( "someurl.com/vp/125514_45147_55144.jpg?_nc25244", "^.*/(.*?)\\..*$" )[[2]]
#[1] "125514_45147_55144"

Inspired by the answer of #G.Grothendieck, a regex-free solution using dirname, basename and chartr
x = 'someurl.com/vp/125514_45147_55144.jpg?_nc25244'
dirname(chartr(x = basename(x), ".", "/"))
# [1] "125514_45147_55144"
Assuming there is no dot in the filename.

Can't figure out why regex group is not working in str_match

I have the following code with a regex
CHARACTER <- ^([A-Z0-9 .])+(?:\\(V\\.O\\.\\))?(?:\\(O\\.S\\.\\))?(?:\\(CONT'D\\))?$
str_match("WILL (V.O.)",CHARACTER)[1,2]
I thought this should match the value of "WILL " but it is returning blank.
I tried the RegEx in a different language and the group is coming back blank in that instance also.
What do I have to add to this regex to pull back just the value "WILL"?

You formed a repeated capturing group by placing + outside a group. Put it back:
CHARACTER <- "^([A-Z0-9 .]+)(?:\\(V\\.O\\.\\))?(?:\\(O\\.S\\.\\))?(?:\\(CONT'D\\))?$"
^
Note you may trim Will if you use a lazy match with \s* after the group:
CHARACTER <- "^([A-Z0-9\\s.]+?)\\s*(?:\\(V\\.O\\.\\))?(?:\\(O\\.S\\.\\))?(?:\\(CONT'D\\))?$"
See this regex demo.
> library(stringr)
> CHARACTER <- "^([A-Z0-9\\s.]+?)\\s*(?:\\(V\\.O\\.\\))?(?:\\(O\\.S\\.\\))?(?:\\(CONT'D\\))?$"
> str_match("WILL (V.O.)",CHARACTER)[1,2]
[1] "WILL"
Alternatively, you may just extract Will with
> str_extract("WILL (V.O.)", "^.*?(?=\\s*(?:\\(V\\.O\\.\\))?(?:\\(O\\.S\\.\\))?(?:\\(CONT'D\\))?$)")
[1] "WILL"
Or the same with base R:
> regmatches(x, regexpr("^.*?(?=\\s*(?:\\(V\\.O\\.\\))?(?:\\(O\\.S\\.\\))?(?:\\(CONT'D\\))?$)", x, perl=TRUE))
[1] "WILL"
Here,
^ - matches the start of a string
.*? - any 0+ chars other than line break chars as few as possible
(?=\\s*(?:\\(V\\.O\\.\\))?(?:\\(O\\.S\\.\\))?(?:\\(CONT'D\\))?$) - makes sure that, immediately to the right of the current location, there is
\\s* - 0+ whitespaces
(?:\\(V\\.O\\.\\))? - an optional (V.O.) substring
(?:\\(O\\.S\\.\\))? - an optional (O.S.) substring
(?:\\(CONT'D\\))? - an optional (CONT'D) substring
$ - end of string.

extract text between certain characters in R

I need to capture TEST_WF1_CORP[-application-com.ibm.ws.runtime.WsServer] from the following string, basically from - to # sign.
i<-c("Current CPU load - TEST_WF1_CORP[-application-com.ibm.ws.runtime.WsServer]#example1.com")
I've tried this:
str_match(i, ".*-([^\\.]*)\\#.*")[,2]
I am getting NA, any ideas?

1) gsub Replace everything up to and including -, i.e. .* -, and everything after and including #, i.e. #.*, with a zero length string. No packages are needed:
gsub(".* - |#.*", "", i)
## "TEST_WF1_CORP[-application-com.ibm.ws.runtime.WsServer]"
2) sub This would also work. It matches everything to space, minus, space (i.e. .* -) and then captures everything until # (i.e. (.*)# ) followed by whatever is left (.*) and replaces that with the capture group, i.e. the part within parens. It also uses no packages.
sub(".*- (.*)#.*", "\\1", i)
## [1] "TEST_WF1_CORP[-application-com.ibm.ws.runtime.WsServer]"
Note: We used this as input i:
i <- "Current CPU load - TEST_WF1_CORP[-application-com.ibm.ws.runtime.WsServer]#example1.com"

The following should work:
extract <- unlist(strsplit(i,"- |#"))[2]

You may use
-\s*([^#]+)
See the regex demo
Details:
- - a hyphen
\s* - zero or more whitespaces
([^#]+) - Group 1 capturing 1 or more chars other than #.
R demo:
> library(stringr)
> i<-c("Current CPU load - TEST_WF1_CORP[-application-com.ibm.ws.runtime.WsServer]#example1.com")
> str_match(i, "-\\s*([^#]+)")[,2]
[1] "TEST_WF1_CORP[-application-com.ibm.ws.runtime.WsServer]"
The same pattern can be used with base R regmatches/regexec:
> regmatches(i, regexec("-\\s*([^#]+)", i))[[1]][2]
[1] "TEST_WF1_CORP[-application-com.ibm.ws.runtime.WsServer]"
If you prefer a replacing approach you may use a sub:
> sub(".*?-\\s*([^#]+).*", "\\1", i)
[1] "TEST_WF1_CORP[-application-com.ibm.ws.runtime.WsServer]"
Here, .*? matches any 0+ chars, as few as possible, up to the first -, then -, 0+ whitespaces (\\s*), then 1+ chars other than # are captured into Group 1 (see ([^#]+)) and then .* matches the rest of the string. The \1 in the replacement pattern puts the contents of Group 1 back into the replacement result.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

how do you extract values between two characters in R? - r

with stringr: library(stringr) str_match(t, ".#([^\\.])\\..*")[2] #[1] "server101"

Related

Retain string up to second slash in regex?

R Regex capture group?

extracting part of path using sub

Can't figure out why regex group is not working in str_match

extract text between certain characters in R

Categories

Resources

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

how do you extract values between two characters in R? - r

with stringr: library(stringr) str_match(t, ".*#([^\\.]*)\\..*")[2] #[1] "server101"

Related

Retain string up to second slash in regex?

R Regex capture group?

extracting part of path using sub

Can't figure out why regex group is not working in str_match

extract text between certain characters in R

Categories

Resources

with stringr: library(stringr) str_match(t, ".#([^\\.])\\..*")[2] #[1] "server101"