Replace all characters except expression using gsub only - r

Given strings:
smple_paths <- c("/path/path/path/abc22/path/path",
"/apath/apath/paath/abc11/something/path")
I would like to replace all characters excluding phrase abc\\d{2}
Attempt
gsub(
pattern = "(?!abc\\d{2})",
replacement = "",
x = smple_paths,
perl = TRUE
)
# [1] "/path/path/path/abc22/path/path"
# [2] "/apath/apath/paath/abc11/something/path"
Desired results
abc22
abc11
Notes
I'm not looking for stringr::str_extract based solution or any other solution not based on gsub

If you do not care about the abc\d{2} context, you may use
sub(".*(abc\\d{2}).*", "\\1", smple_paths)
See this regex demo and this R demo.
If you care about the context, you may match and capture abc + 2 digits after / and before / or end of the string, while matching any text before and after this pattern using
sub("^.*/(abc\\d{2})(?:/.*)?$", "\\1", smple_paths)
See the R demo and a regex demo.
Details
^ - start of the string (not necessary here, but kept for the sake of clarity)
.* - any 0+ chars, as many as possible
/ - a / char
(abc\\d{2}) - Group 1: abc and 2 digits
(?:/.*)? - an optional (1 or 0) occurrence of a / followed with any 0+ chars as many as possible
$ - end of string.
The \1 placeholder in the replacement pattern inserts the captured text back into the result.

Related

R Regex capture group?

I have a lot of strings like this:
2019/01/01/07/556662_cba3a4fc-cb8f-4150-859f-5f21a38373d0
I want to extract the substring that lays right after the last "/" and ends with "_":
556662
I have found out how to extract: /01/01/07/556662
by using the following regex: (\/)(.*?)(?=\_)
Please advise how can I capture the right group.
You may use
x <- "2019/01/01/07/556662_cba3a4fc-cb8f-4150-859f-5f21a38373d0"
regmatches(x, regexpr(".*/\\K[^_]+", x, perl=TRUE))
## [1] "556662"
See the regex and R demo.
Here, the regex matches and outputs the first substring that matches
.*/ - any 0+ chars as many as possible up to the last /
\K - omits this part from the match
[^_]+ - puts 1 or more chars other than _ into the match value.
Or, a sub solution:
sub(".*/([^_]+).*", "\\1", x)
See the regex demo.
Here, it is similar to the previous one, but the 1 or more chars other than _ are captured into Group 1 (\1 in the replacement pattern) and the trailing .* make sure the whole input is matched (and consumed, ready to be replaced).
Alternative non-base R solutions
If you can afford or prefer to work with stringi, you may use
library(stringi)
stri_match_last_regex("2019/01/01/07/556662_cba3a4fc-cb8f-4150-859f-5f21a38373d0", ".*/([^_]+)")[,2]
## [1] "556662"
This will match a string up to the last / and will capture into Group 1 (that you access in Column 2 using [,2]) 1 or more chars other than _.
Or
stri_extract_last_regex("2019/01/01/07/556662_cba3a4fc-cb8f-4150-859f-5f21a38373d0", "(?<=/)[^_/]+")
## => [1] "556662"
This will extract the last match of a string that consists of 1 or more chars other than _ and / after a /.
You could use a capturing group:
/([^_/]+)_[^/\s]*
Explanation
/ Match literally
([^_/]+) Capture in a group matching not an underscore or forward slash
_[^/\s]* Match _ and then 0+ times not a forward slash or a whitespace character
Regex demo | R demo
One option to get the capturing group might be to get the second column using str_match:
library(stringr)
str = c("2019/01/01/07/556662_cba3a4fc-cb8f-4150-859f-5f21a38373d0")
str_match(str, "/([^_/]+)_[^/\\s]*")[,2]
# [1] "556662"
I changed the Regex rules according to the code of Wiktor Stribiżew.
x <- "2019/01/01/07/556662_cba3a4fc-cb8f-4150-859f-5f21a38373d0"
regmatches(x, regexpr(".*/([0-9]+)", x, perl=TRUE))
sub(".*/([0-9]+).*", "\\1", x)
Output
[1] "2019/01/01/07/556662"
[1] "556662"
R demo

extracting part of path using sub

I'm attempting to extract a filename from a path in r. In a string like
someurl.com/vp/125514_45147_55144.jpg?_nc25244
I want to extract 125514_45147_55144
I'm using the following expression:
sub(".*vp/(.*?)/.*", "\\1", input)
which works but it also strips the underscores:
1255144514755144
I cannot figure out how to retain the underscores
Remove dot and everything after it of the basename:
sub("\\..*", "", basename(x))
## [1] "125514_45147_55144"
If it is possible that there are dots in the filename then use this slightly more complex pattern:
sub("(.*)\\..*", "\\1", basename(x))
## [1] "125514_45147_55144"
I suggest fixing it as
sub(".*/vp/([^/?]*?)\\.[^/?.]*(?:\\?.*)?$", "\\1", input)
See the regex demo
Details
.* - any 0+ chars as many as possible
/vp/ - a literal substring
([^/?]*?) - Group 1 (its captured value is referenced by \1 from the replacement pattern): any 0+ chars other than / and ?, as few as possible
\\. - a dot
[^/?.]* - 0+ chars other than ., ? and /
(?:\\?.*)? - an optional substring matching ? and then any 0+ chars as many as possible
$ - end of string.
With regmatches/regexec the pattern becomes much clearer:
x <- "someurl.com/vp/125514_45147_55144.jpg?_nc25244"
regmatches(x,regexec("/vp/([^/?]*)\\.",x))[[1]][2]
## => [1] "125514_45147_55144"
See the R demo
stringr alternative
library( stringr )
str_match( "someurl.com/vp/125514_45147_55144.jpg?_nc25244", "^.*/(.*?)\\..*$" )[[2]]
#[1] "125514_45147_55144"
Inspired by the answer of #G.Grothendieck, a regex-free solution using dirname, basename and chartr
x = 'someurl.com/vp/125514_45147_55144.jpg?_nc25244'
dirname(chartr(x = basename(x), ".", "/"))
# [1] "125514_45147_55144"
Assuming there is no dot in the filename.

Regular Expression - Extract Word in r

how can I extract MLA723950998 from this string?
"https://auto.mercadolibre.com.ar/MLA-723950998-peugeot-208-0km-16-active-plan-100-financiado-darc-_JM"
I was able to manage to extract MLA.
gsub('.*(M\\w+).*', '\\1', "https://auto.mercadolibre.com.ar/MLA-723950998-peugeot-208-0km-16-active-plan-100-financiado-darc-_JM")
MLA
You may use
.*/(M\w+)-(\d+).*
and replace with \1\2.
Details
.*/ - any 0+ chars, as many as possible, up to and including the last / in the string
(M\w+) - Group 1 (later referred to with \1 placeholder from the replacement pattern): M and 1+ letters, digits or/and _
- - a hyphen
(\d+) - Group 2 (later referred to with \2 placeholder from the replacement pattern): one or more digits
.* - the rest of the string.
See the regex demo
See the R demo:
x <- "https://auto.mercadolibre.com.ar/MLA-723950998-peugeot-208-0km-16-active-plan-100-financiado-darc-_JM"
gsub('.*/(M\\w+)-(\\d+).*', '\\1\\2', x)
# => [1] "MLA723950998"
Maybe this solution works for you:
library(stringi)
x = "https://auto.mercadolibre.com.ar/MLA-723950998-peugeot-208-0km-16-active-plan-100-financiado-darc-_JM"
stri_extract_last_regex(x, "(?<=/)([A-Za-z]+.\\d+)(?=[^/]+$)")
[1] "MLA-723950998"
(i) The first lookbehind finds the position of a slash, (ii) which is then followed by letters, 1 x any character and digits, (iii) which by the lookahead may only be followed by anything but a slash.

In R is there a way to extract data based on the beginning and end of a pattern but not the middle data?

In R is there a way to extract data based on the beginning and end of a pattern but not the middle data?
ie. if the following was in a single cell
(1) Number = '1111111111, 0000000000' Text =....
(2) Number = '0000000000' Text =....
it would result in:
(1) 1111111111, 0000000000
(2) 0000000000
I tried:
x1<-str_match(x,"(?<=Number'\\s\\=\\s\\')(\\d|\\s|\\,)\\d\\'")
but that doesn't work.
We can try with str_extract_all
library(stringr)
sapply(str_extract_all(x, "[0-9]+"), toString)
#[1] "1111111111, 0000000000" "0000000000"
You may use a PCRE regex to extract the numbers after Number=' from your input text:
(?:Number\s*=\s*'|\G(?!\A)\s*,\s*)\K\d+
See the regex demo.
Pattern details:
(?:Number\s*=\s*'|\G(?!\A)\s*,\s*) - either of the two alternatives:
Number\s*=\s*' - Number and a = enclosed with 0+ whitespaces
| - or
\G(?!\A)\s*,\s* - end of the previous successful match (\G(?!\A)) and a comma enclosed with 0+ whitespaces (\s*)
\K - omit the text matched so far
\d+ - 1+ digits (returned as a match)
See the R demo:
> x <- c("(1) Number = '1111111111, 0000000000' Text =....", "(2) Number = '0000000000' Text =....")
> regmatches(x, gregexpr("(?:Number\\s*=\\s*'|\\G(?!\\A)\\s*,\\s*)\\K\\d+", x, perl=TRUE))
[[1]]
[1] "1111111111" "0000000000"
[[2]]
[1] "0000000000"

extract text between certain characters in R

I need to capture TEST_WF1_CORP[-application-com.ibm.ws.runtime.WsServer] from the following string, basically from - to # sign.
i<-c("Current CPU load - TEST_WF1_CORP[-application-com.ibm.ws.runtime.WsServer]#example1.com")
I've tried this:
str_match(i, ".*-([^\\.]*)\\#.*")[,2]
I am getting NA, any ideas?
1) gsub Replace everything up to and including -, i.e. .* -, and everything after and including #, i.e. #.*, with a zero length string. No packages are needed:
gsub(".* - |#.*", "", i)
## "TEST_WF1_CORP[-application-com.ibm.ws.runtime.WsServer]"
2) sub This would also work. It matches everything to space, minus, space (i.e. .* -) and then captures everything until # (i.e. (.*)# ) followed by whatever is left (.*) and replaces that with the capture group, i.e. the part within parens. It also uses no packages.
sub(".*- (.*)#.*", "\\1", i)
## [1] "TEST_WF1_CORP[-application-com.ibm.ws.runtime.WsServer]"
Note: We used this as input i:
i <- "Current CPU load - TEST_WF1_CORP[-application-com.ibm.ws.runtime.WsServer]#example1.com"
The following should work:
extract <- unlist(strsplit(i,"- |#"))[2]
You may use
-\s*([^#]+)
See the regex demo
Details:
- - a hyphen
\s* - zero or more whitespaces
([^#]+) - Group 1 capturing 1 or more chars other than #.
R demo:
> library(stringr)
> i<-c("Current CPU load - TEST_WF1_CORP[-application-com.ibm.ws.runtime.WsServer]#example1.com")
> str_match(i, "-\\s*([^#]+)")[,2]
[1] "TEST_WF1_CORP[-application-com.ibm.ws.runtime.WsServer]"
The same pattern can be used with base R regmatches/regexec:
> regmatches(i, regexec("-\\s*([^#]+)", i))[[1]][2]
[1] "TEST_WF1_CORP[-application-com.ibm.ws.runtime.WsServer]"
If you prefer a replacing approach you may use a sub:
> sub(".*?-\\s*([^#]+).*", "\\1", i)
[1] "TEST_WF1_CORP[-application-com.ibm.ws.runtime.WsServer]"
Here, .*? matches any 0+ chars, as few as possible, up to the first -, then -, 0+ whitespaces (\\s*), then 1+ chars other than # are captured into Group 1 (see ([^#]+)) and then .* matches the rest of the string. The \1 in the replacement pattern puts the contents of Group 1 back into the replacement result.

Resources