extracting part of path using sub - r

I'm attempting to extract a filename from a path in r. In a string like
someurl.com/vp/125514_45147_55144.jpg?_nc25244
I want to extract 125514_45147_55144
I'm using the following expression:
sub(".*vp/(.*?)/.*", "\\1", input)
which works but it also strips the underscores:
1255144514755144
I cannot figure out how to retain the underscores

Remove dot and everything after it of the basename:
sub("\\..*", "", basename(x))
## [1] "125514_45147_55144"
If it is possible that there are dots in the filename then use this slightly more complex pattern:
sub("(.*)\\..*", "\\1", basename(x))
## [1] "125514_45147_55144"

I suggest fixing it as
sub(".*/vp/([^/?]*?)\\.[^/?.]*(?:\\?.*)?$", "\\1", input)
See the regex demo
Details
.* - any 0+ chars as many as possible
/vp/ - a literal substring
([^/?]*?) - Group 1 (its captured value is referenced by \1 from the replacement pattern): any 0+ chars other than / and ?, as few as possible
\\. - a dot
[^/?.]* - 0+ chars other than ., ? and /
(?:\\?.*)? - an optional substring matching ? and then any 0+ chars as many as possible
$ - end of string.
With regmatches/regexec the pattern becomes much clearer:
x <- "someurl.com/vp/125514_45147_55144.jpg?_nc25244"
regmatches(x,regexec("/vp/([^/?]*)\\.",x))[[1]][2]
## => [1] "125514_45147_55144"
See the R demo

stringr alternative
library( stringr )
str_match( "someurl.com/vp/125514_45147_55144.jpg?_nc25244", "^.*/(.*?)\\..*$" )[[2]]
#[1] "125514_45147_55144"

Inspired by the answer of #G.Grothendieck, a regex-free solution using dirname, basename and chartr
x = 'someurl.com/vp/125514_45147_55144.jpg?_nc25244'
dirname(chartr(x = basename(x), ".", "/"))
# [1] "125514_45147_55144"
Assuming there is no dot in the filename.

Related

Extract all text after last occurrence of a special character

I have the string in R
BLCU142-09|Apodemia_mejicanus
and I would like to get the result
Apodemia_mejicanus
Using the stringr R package, I have tried
str_replace_all("BLCU142-09|Apodemia_mejicanus", "[[A-Z0-9|-]]", "")
# [1] "podemia_mejicanus"
which is almost what I need, except that the A is missing.
You can use
sub(".*\\|", "", x)
This will remove all text up to and including the last pipe char. See the regex demo. Details:
.* - any zero or more chars as many as possible
\| - a | char (| is a special regex metacharacter that is an alternation operator, so it must be escaped, and since string literals in R can contain string escape sequences, the | is escaped with a double backslash).
See the R demo online:
x <- c("BLCU142-09|Apodemia_mejicanus", "a|b|c|BLCU142-09|Apodemia_mejicanus")
sub(".*\\|", "", x)
## => [1] "Apodemia_mejicanus" "Apodemia_mejicanus"
We can match one or more characters that are not a | ([^|]+) from the start (^) of the string followed by | in str_remove to remove that substring
library(stringr)
str_remove(str1, "^[^|]+\\|")
#[1] "Apodemia_mejicanus"
If we use [A-Z] also to match it will match the upper case letter and replace with blank ("") as in the OP's str_replace_all
data
str1 <- "BLCU142-09|Apodemia_mejicanus"
You can always choose to _extract rather than _remove:
s <- "BLCU142-09|Apodemia_mejicanus"
stringr::str_extract(s,"[[:alpha:]_]+$")
## [1] "Apodemia_mejicanus"
Depending on how permissive you want to be, you could also use [[:alpha:]]+_[[:alpha:]]+ as your target.
I would keep it simple:
substring(my_string, regexpr("|", my_string, fixed = TRUE) + 1L)

Retain string up to second slash in regex?

I am trying to only retain the string after the first section of characters (which includes - and numerics) but before the forward slash.
I have the following string:
x <- c('/youtube.com/videos/cats', '/google.com/images/dogs', 'bbc.com/movies')
/youtube.com/videos/cats
/google.com/images/dogs
bbc.com/movies
So it would look like this
/youtube.com/
/google.com/
bbc.com/
For reference I am using R 3.6
I have tried positive lookbehinds and the closest I got was this: ^\/[^\/]*
Any help appreciated
So in the bbc.com/movies example - the string does not start with a forward slash / but I still want to be able to keep the bbc.com part during the match
You can use a sub here to only perform a single regex replacement:
sub('^(/?[^/]*/).*', '\\1', x)
See the regex demo.
Details
^ - start of string
-(/?[^/]*/) - Capturing group 1 (\1 in the replacement pattern): an optional /, then 0 or more chars other than / and then a /
.* - any zero or more chars, as many as possible.
See an R test online:
test <- c("/youtube.com/videos/cats", "/google.com/images/dogs", "bbc.com/movies")
sub('^(/?[^/]*/).*', '\\1', test)
# => [1] "/youtube.com/" "/google.com/" "bbc.com/"
First great username. Try this, you can leverage the fact str_extract only pulls the first match out. assuming all urls match letters.letters this pattern should work. Let me know if you have numbers in any of them.
library(stringr)
c("/youtube.com/videos/cats",
"/google.com/images/dogs",
"bbc.com/movies") %>%
str_extract(., "/?\\w+\\.\\w+/")
produces
"/youtube.com/" "/google.com/" "bbc.com/"
Using base R
gsub('(\\/?.*\\.com\\/).*', '\\1', x)
[1] "/youtube.com/" "/google.com/" "bbc.com/"
an alternative would be with the rebus Package:
library(rebus)
library(stringi)
t <- c("/youtube.com/videos/cats"," /google.com/images/dogs"," bbc.com/movie")
pattern <- zero_or_more("/") %R% one_or_more(ALPHA) %R% DOT %R% one_or_more(ALPHA) %R% zero_or_more("/")
stringi::stri_extract_first_regex(t, pattern)
[1] "/youtube.com/" "/google.com/" "bbc.com/"

R Regex capture group?

I have a lot of strings like this:
2019/01/01/07/556662_cba3a4fc-cb8f-4150-859f-5f21a38373d0
I want to extract the substring that lays right after the last "/" and ends with "_":
556662
I have found out how to extract: /01/01/07/556662
by using the following regex: (\/)(.*?)(?=\_)
Please advise how can I capture the right group.
You may use
x <- "2019/01/01/07/556662_cba3a4fc-cb8f-4150-859f-5f21a38373d0"
regmatches(x, regexpr(".*/\\K[^_]+", x, perl=TRUE))
## [1] "556662"
See the regex and R demo.
Here, the regex matches and outputs the first substring that matches
.*/ - any 0+ chars as many as possible up to the last /
\K - omits this part from the match
[^_]+ - puts 1 or more chars other than _ into the match value.
Or, a sub solution:
sub(".*/([^_]+).*", "\\1", x)
See the regex demo.
Here, it is similar to the previous one, but the 1 or more chars other than _ are captured into Group 1 (\1 in the replacement pattern) and the trailing .* make sure the whole input is matched (and consumed, ready to be replaced).
Alternative non-base R solutions
If you can afford or prefer to work with stringi, you may use
library(stringi)
stri_match_last_regex("2019/01/01/07/556662_cba3a4fc-cb8f-4150-859f-5f21a38373d0", ".*/([^_]+)")[,2]
## [1] "556662"
This will match a string up to the last / and will capture into Group 1 (that you access in Column 2 using [,2]) 1 or more chars other than _.
Or
stri_extract_last_regex("2019/01/01/07/556662_cba3a4fc-cb8f-4150-859f-5f21a38373d0", "(?<=/)[^_/]+")
## => [1] "556662"
This will extract the last match of a string that consists of 1 or more chars other than _ and / after a /.
You could use a capturing group:
/([^_/]+)_[^/\s]*
Explanation
/ Match literally
([^_/]+) Capture in a group matching not an underscore or forward slash
_[^/\s]* Match _ and then 0+ times not a forward slash or a whitespace character
Regex demo | R demo
One option to get the capturing group might be to get the second column using str_match:
library(stringr)
str = c("2019/01/01/07/556662_cba3a4fc-cb8f-4150-859f-5f21a38373d0")
str_match(str, "/([^_/]+)_[^/\\s]*")[,2]
# [1] "556662"
I changed the Regex rules according to the code of Wiktor Stribiżew.
x <- "2019/01/01/07/556662_cba3a4fc-cb8f-4150-859f-5f21a38373d0"
regmatches(x, regexpr(".*/([0-9]+)", x, perl=TRUE))
sub(".*/([0-9]+).*", "\\1", x)
Output
[1] "2019/01/01/07/556662"
[1] "556662"
R demo

Replace all characters except expression using gsub only

Given strings:
smple_paths <- c("/path/path/path/abc22/path/path",
"/apath/apath/paath/abc11/something/path")
I would like to replace all characters excluding phrase abc\\d{2}
Attempt
gsub(
pattern = "(?!abc\\d{2})",
replacement = "",
x = smple_paths,
perl = TRUE
)
# [1] "/path/path/path/abc22/path/path"
# [2] "/apath/apath/paath/abc11/something/path"
Desired results
abc22
abc11
Notes
I'm not looking for stringr::str_extract based solution or any other solution not based on gsub
If you do not care about the abc\d{2} context, you may use
sub(".*(abc\\d{2}).*", "\\1", smple_paths)
See this regex demo and this R demo.
If you care about the context, you may match and capture abc + 2 digits after / and before / or end of the string, while matching any text before and after this pattern using
sub("^.*/(abc\\d{2})(?:/.*)?$", "\\1", smple_paths)
See the R demo and a regex demo.
Details
^ - start of the string (not necessary here, but kept for the sake of clarity)
.* - any 0+ chars, as many as possible
/ - a / char
(abc\\d{2}) - Group 1: abc and 2 digits
(?:/.*)? - an optional (1 or 0) occurrence of a / followed with any 0+ chars as many as possible
$ - end of string.
The \1 placeholder in the replacement pattern inserts the captured text back into the result.

extract text between certain characters in R

I need to capture TEST_WF1_CORP[-application-com.ibm.ws.runtime.WsServer] from the following string, basically from - to # sign.
i<-c("Current CPU load - TEST_WF1_CORP[-application-com.ibm.ws.runtime.WsServer]#example1.com")
I've tried this:
str_match(i, ".*-([^\\.]*)\\#.*")[,2]
I am getting NA, any ideas?
1) gsub Replace everything up to and including -, i.e. .* -, and everything after and including #, i.e. #.*, with a zero length string. No packages are needed:
gsub(".* - |#.*", "", i)
## "TEST_WF1_CORP[-application-com.ibm.ws.runtime.WsServer]"
2) sub This would also work. It matches everything to space, minus, space (i.e. .* -) and then captures everything until # (i.e. (.*)# ) followed by whatever is left (.*) and replaces that with the capture group, i.e. the part within parens. It also uses no packages.
sub(".*- (.*)#.*", "\\1", i)
## [1] "TEST_WF1_CORP[-application-com.ibm.ws.runtime.WsServer]"
Note: We used this as input i:
i <- "Current CPU load - TEST_WF1_CORP[-application-com.ibm.ws.runtime.WsServer]#example1.com"
The following should work:
extract <- unlist(strsplit(i,"- |#"))[2]
You may use
-\s*([^#]+)
See the regex demo
Details:
- - a hyphen
\s* - zero or more whitespaces
([^#]+) - Group 1 capturing 1 or more chars other than #.
R demo:
> library(stringr)
> i<-c("Current CPU load - TEST_WF1_CORP[-application-com.ibm.ws.runtime.WsServer]#example1.com")
> str_match(i, "-\\s*([^#]+)")[,2]
[1] "TEST_WF1_CORP[-application-com.ibm.ws.runtime.WsServer]"
The same pattern can be used with base R regmatches/regexec:
> regmatches(i, regexec("-\\s*([^#]+)", i))[[1]][2]
[1] "TEST_WF1_CORP[-application-com.ibm.ws.runtime.WsServer]"
If you prefer a replacing approach you may use a sub:
> sub(".*?-\\s*([^#]+).*", "\\1", i)
[1] "TEST_WF1_CORP[-application-com.ibm.ws.runtime.WsServer]"
Here, .*? matches any 0+ chars, as few as possible, up to the first -, then -, 0+ whitespaces (\\s*), then 1+ chars other than # are captured into Group 1 (see ([^#]+)) and then .* matches the rest of the string. The \1 in the replacement pattern puts the contents of Group 1 back into the replacement result.

Resources