extract text between certain characters in R

extract text between certain characters in R - r

I need to capture TEST_WF1_CORP[-application-com.ibm.ws.runtime.WsServer] from the following string, basically from - to # sign.
i<-c("Current CPU load - TEST_WF1_CORP[-application-com.ibm.ws.runtime.WsServer]#example1.com")
I've tried this:
str_match(i, ".*-([^\\.]*)\\#.*")[,2]
I am getting NA, any ideas?

1) gsub Replace everything up to and including -, i.e. .* -, and everything after and including #, i.e. #.*, with a zero length string. No packages are needed:
gsub(".* - |#.*", "", i)
## "TEST_WF1_CORP[-application-com.ibm.ws.runtime.WsServer]"
2) sub This would also work. It matches everything to space, minus, space (i.e. .* -) and then captures everything until # (i.e. (.*)# ) followed by whatever is left (.*) and replaces that with the capture group, i.e. the part within parens. It also uses no packages.
sub(".*- (.*)#.*", "\\1", i)
## [1] "TEST_WF1_CORP[-application-com.ibm.ws.runtime.WsServer]"
Note: We used this as input i:
i <- "Current CPU load - TEST_WF1_CORP[-application-com.ibm.ws.runtime.WsServer]#example1.com"

The following should work:
extract <- unlist(strsplit(i,"- |#"))[2]

You may use
-\s*([^#]+)
See the regex demo
Details:
- - a hyphen
\s* - zero or more whitespaces
([^#]+) - Group 1 capturing 1 or more chars other than #.
R demo:
> library(stringr)
> i<-c("Current CPU load - TEST_WF1_CORP[-application-com.ibm.ws.runtime.WsServer]#example1.com")
> str_match(i, "-\\s*([^#]+)")[,2]
[1] "TEST_WF1_CORP[-application-com.ibm.ws.runtime.WsServer]"
The same pattern can be used with base R regmatches/regexec:
> regmatches(i, regexec("-\\s*([^#]+)", i))[[1]][2]
[1] "TEST_WF1_CORP[-application-com.ibm.ws.runtime.WsServer]"
If you prefer a replacing approach you may use a sub:
> sub(".*?-\\s*([^#]+).*", "\\1", i)
[1] "TEST_WF1_CORP[-application-com.ibm.ws.runtime.WsServer]"
Here, .*? matches any 0+ chars, as few as possible, up to the first -, then -, 0+ whitespaces (\\s*), then 1+ chars other than # are captured into Group 1 (see ([^#]+)) and then .* matches the rest of the string. The \1 in the replacement pattern puts the contents of Group 1 back into the replacement result.

Related

R Regex capture group?

I have a lot of strings like this:
2019/01/01/07/556662_cba3a4fc-cb8f-4150-859f-5f21a38373d0
I want to extract the substring that lays right after the last "/" and ends with "_":
556662
I have found out how to extract: /01/01/07/556662
by using the following regex: (\/)(.*?)(?=\_)
Please advise how can I capture the right group.

You may use
x <- "2019/01/01/07/556662_cba3a4fc-cb8f-4150-859f-5f21a38373d0"
regmatches(x, regexpr(".*/\\K[^_]+", x, perl=TRUE))
## [1] "556662"
See the regex and R demo.
Here, the regex matches and outputs the first substring that matches
.*/ - any 0+ chars as many as possible up to the last /
\K - omits this part from the match
[^_]+ - puts 1 or more chars other than _ into the match value.
Or, a sub solution:
sub(".*/([^_]+).*", "\\1", x)
See the regex demo.
Here, it is similar to the previous one, but the 1 or more chars other than _ are captured into Group 1 (\1 in the replacement pattern) and the trailing .* make sure the whole input is matched (and consumed, ready to be replaced).
Alternative non-base R solutions
If you can afford or prefer to work with stringi, you may use
library(stringi)
stri_match_last_regex("2019/01/01/07/556662_cba3a4fc-cb8f-4150-859f-5f21a38373d0", ".*/([^_]+)")[,2]
## [1] "556662"
This will match a string up to the last / and will capture into Group 1 (that you access in Column 2 using [,2]) 1 or more chars other than _.
Or
stri_extract_last_regex("2019/01/01/07/556662_cba3a4fc-cb8f-4150-859f-5f21a38373d0", "(?<=/)[^_/]+")
## => [1] "556662"
This will extract the last match of a string that consists of 1 or more chars other than _ and / after a /.

You could use a capturing group:
/([^_/]+)_[^/\s]*
Explanation
/ Match literally
([^_/]+) Capture in a group matching not an underscore or forward slash
_[^/\s]* Match _ and then 0+ times not a forward slash or a whitespace character
Regex demo | R demo
One option to get the capturing group might be to get the second column using str_match:
library(stringr)
str = c("2019/01/01/07/556662_cba3a4fc-cb8f-4150-859f-5f21a38373d0")
str_match(str, "/([^_/]+)_[^/\\s]*")[,2]
# [1] "556662"

I changed the Regex rules according to the code of Wiktor Stribiżew.
x <- "2019/01/01/07/556662_cba3a4fc-cb8f-4150-859f-5f21a38373d0"
regmatches(x, regexpr(".*/([0-9]+)", x, perl=TRUE))
sub(".*/([0-9]+).*", "\\1", x)
Output
[1] "2019/01/01/07/556662"
[1] "556662"
R demo

Can't figure out why regex group is not working in str_match

I have the following code with a regex
CHARACTER <- ^([A-Z0-9 .])+(?:\\(V\\.O\\.\\))?(?:\\(O\\.S\\.\\))?(?:\\(CONT'D\\))?$
str_match("WILL (V.O.)",CHARACTER)[1,2]
I thought this should match the value of "WILL " but it is returning blank.
I tried the RegEx in a different language and the group is coming back blank in that instance also.
What do I have to add to this regex to pull back just the value "WILL"?

You formed a repeated capturing group by placing + outside a group. Put it back:
CHARACTER <- "^([A-Z0-9 .]+)(?:\\(V\\.O\\.\\))?(?:\\(O\\.S\\.\\))?(?:\\(CONT'D\\))?$"
^
Note you may trim Will if you use a lazy match with \s* after the group:
CHARACTER <- "^([A-Z0-9\\s.]+?)\\s*(?:\\(V\\.O\\.\\))?(?:\\(O\\.S\\.\\))?(?:\\(CONT'D\\))?$"
See this regex demo.
> library(stringr)
> CHARACTER <- "^([A-Z0-9\\s.]+?)\\s*(?:\\(V\\.O\\.\\))?(?:\\(O\\.S\\.\\))?(?:\\(CONT'D\\))?$"
> str_match("WILL (V.O.)",CHARACTER)[1,2]
[1] "WILL"
Alternatively, you may just extract Will with
> str_extract("WILL (V.O.)", "^.*?(?=\\s*(?:\\(V\\.O\\.\\))?(?:\\(O\\.S\\.\\))?(?:\\(CONT'D\\))?$)")
[1] "WILL"
Or the same with base R:
> regmatches(x, regexpr("^.*?(?=\\s*(?:\\(V\\.O\\.\\))?(?:\\(O\\.S\\.\\))?(?:\\(CONT'D\\))?$)", x, perl=TRUE))
[1] "WILL"
Here,
^ - matches the start of a string
.*? - any 0+ chars other than line break chars as few as possible
(?=\\s*(?:\\(V\\.O\\.\\))?(?:\\(O\\.S\\.\\))?(?:\\(CONT'D\\))?$) - makes sure that, immediately to the right of the current location, there is
\\s* - 0+ whitespaces
(?:\\(V\\.O\\.\\))? - an optional (V.O.) substring
(?:\\(O\\.S\\.\\))? - an optional (O.S.) substring
(?:\\(CONT'D\\))? - an optional (CONT'D) substring
$ - end of string.

Regular Expression - Extract Word in r

how can I extract MLA723950998 from this string?
"https://auto.mercadolibre.com.ar/MLA-723950998-peugeot-208-0km-16-active-plan-100-financiado-darc-_JM"
I was able to manage to extract MLA.
gsub('.*(M\\w+).*', '\\1', "https://auto.mercadolibre.com.ar/MLA-723950998-peugeot-208-0km-16-active-plan-100-financiado-darc-_JM")
MLA

You may use
.*/(M\w+)-(\d+).*
and replace with \1\2.
Details
.*/ - any 0+ chars, as many as possible, up to and including the last / in the string
(M\w+) - Group 1 (later referred to with \1 placeholder from the replacement pattern): M and 1+ letters, digits or/and _
- - a hyphen
(\d+) - Group 2 (later referred to with \2 placeholder from the replacement pattern): one or more digits
.* - the rest of the string.
See the regex demo
See the R demo:
x <- "https://auto.mercadolibre.com.ar/MLA-723950998-peugeot-208-0km-16-active-plan-100-financiado-darc-_JM"
gsub('.*/(M\\w+)-(\\d+).*', '\\1\\2', x)
# => [1] "MLA723950998"

Maybe this solution works for you:
library(stringi)
x = "https://auto.mercadolibre.com.ar/MLA-723950998-peugeot-208-0km-16-active-plan-100-financiado-darc-_JM"
stri_extract_last_regex(x, "(?<=/)([A-Za-z]+.\\d+)(?=[^/]+$)")
[1] "MLA-723950998"
(i) The first lookbehind finds the position of a slash, (ii) which is then followed by letters, 1 x any character and digits, (iii) which by the lookahead may only be followed by anything but a slash.

Replace all characters except expression using gsub only

Given strings:
smple_paths <- c("/path/path/path/abc22/path/path",
"/apath/apath/paath/abc11/something/path")
I would like to replace all characters excluding phrase abc\\d{2}
Attempt
gsub(
pattern = "(?!abc\\d{2})",
replacement = "",
x = smple_paths,
perl = TRUE
)
# [1] "/path/path/path/abc22/path/path"
# [2] "/apath/apath/paath/abc11/something/path"
Desired results
abc22
abc11
Notes
I'm not looking for stringr::str_extract based solution or any other solution not based on gsub

If you do not care about the abc\d{2} context, you may use
sub(".*(abc\\d{2}).*", "\\1", smple_paths)
See this regex demo and this R demo.
If you care about the context, you may match and capture abc + 2 digits after / and before / or end of the string, while matching any text before and after this pattern using
sub("^.*/(abc\\d{2})(?:/.*)?$", "\\1", smple_paths)
See the R demo and a regex demo.
Details
^ - start of the string (not necessary here, but kept for the sake of clarity)
.* - any 0+ chars, as many as possible
/ - a / char
(abc\\d{2}) - Group 1: abc and 2 digits
(?:/.*)? - an optional (1 or 0) occurrence of a / followed with any 0+ chars as many as possible
$ - end of string.
The \1 placeholder in the replacement pattern inserts the captured text back into the result.

R Strsplit keep delimiter in second element

I have been trying to solve this little issue for almost 2 hours, but without success. I simply want to separate a string by the delimiter: one space followed by any character. In the second element I want to keep the delimiter, whereas in the first element it shall not appear. Example:
x <- "123123 123 A123"
strsplit(x," [A-Z]")
results in:
"123123 123" "A123"
However, this does not keep the letter A in the second element.
I have tried using
strsplit(x,"(?<=[A-Z])",perl=T)
but this does not really work for my issue. It would also be okay, if there is a space in the second element, it just need the character in it.

If you want to follow your approach, you need to match 1+ whitespaces followed (i.e. you need a lookahead here) with a letter to consume the whitespaces:
> strsplit(x,"\\s+(?=[A-Z])",perl=T)
[[1]]
[1] "123123 123" "A123"
See the PCRE regex demo.
Details:
\s+ - 1 or more whitespaces (put into the match value and thus will be removed during splitting)
(?=[A-Z]) - the uppercase ASCII letter must appear immediately to the right of the current location, else fail the match (the letter is not part of the match value, and will be kept in the result)
You may also match up to the last non-whitespace char followed with 1+ whitespaces and use \K match reset operator to discard the match before the whitespace:
> strsplit(x,"^.*\\S\\K\\s+",perl=T)
[[1]]
[1] "123123 123" "A123"
If the string contains line breaks, add a DOTALL flag since a dot in a PCRE regex does not match line breaks by default: "(?s)^.*\\S\\K\\s+".
Details:
^ - start of string
.* - any 0+ chars up to the last occurrence of the subsequent subpatterns (that is, \S\s+)
\\S - a non-whitespace
\\K - here, drop all the text matched so far
\\s+ - 1 or more whitespaces.
See another PCRE regex demo.

I would go with stringi package:
library(stringi)
x <- c("123123 123 A123","34512 321 B521")#some modified input data
l1<-stri_split(x,fixed=" ")
[1] "123123" "123" "A123"
Then:
lapply(seq_along(1:length(l1)), function(x) c(paste0(l1[[x]][1]," ",l1[[x]][2]),l1[[x]][3]))
[[1]]
[1] "123123 123" "A123"
[[2]]
[1] "34512 321" "B521"

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

extract text between certain characters in R - r

The following should work: extract <- unlist(strsplit(i,"- |#"))[2]

Related

R Regex capture group?

Can't figure out why regex group is not working in str_match

Regular Expression - Extract Word in r

Replace all characters except expression using gsub only

R Strsplit keep delimiter in second element

Categories

Resources