Retain string up to second slash in regex? - r

I am trying to only retain the string after the first section of characters (which includes - and numerics) but before the forward slash.
I have the following string:
x <- c('/youtube.com/videos/cats', '/google.com/images/dogs', 'bbc.com/movies')
/youtube.com/videos/cats
/google.com/images/dogs
bbc.com/movies
So it would look like this
/youtube.com/
/google.com/
bbc.com/
For reference I am using R 3.6
I have tried positive lookbehinds and the closest I got was this: ^\/[^\/]*
Any help appreciated
So in the bbc.com/movies example - the string does not start with a forward slash / but I still want to be able to keep the bbc.com part during the match

You can use a sub here to only perform a single regex replacement:
sub('^(/?[^/]*/).*', '\\1', x)
See the regex demo.
Details
^ - start of string
-(/?[^/]*/) - Capturing group 1 (\1 in the replacement pattern): an optional /, then 0 or more chars other than / and then a /
.* - any zero or more chars, as many as possible.
See an R test online:
test <- c("/youtube.com/videos/cats", "/google.com/images/dogs", "bbc.com/movies")
sub('^(/?[^/]*/).*', '\\1', test)
# => [1] "/youtube.com/" "/google.com/" "bbc.com/"

First great username. Try this, you can leverage the fact str_extract only pulls the first match out. assuming all urls match letters.letters this pattern should work. Let me know if you have numbers in any of them.
library(stringr)
c("/youtube.com/videos/cats",
"/google.com/images/dogs",
"bbc.com/movies") %>%
str_extract(., "/?\\w+\\.\\w+/")
produces
"/youtube.com/" "/google.com/" "bbc.com/"

Using base R
gsub('(\\/?.*\\.com\\/).*', '\\1', x)
[1] "/youtube.com/" "/google.com/" "bbc.com/"

an alternative would be with the rebus Package:
library(rebus)
library(stringi)
t <- c("/youtube.com/videos/cats"," /google.com/images/dogs"," bbc.com/movie")
pattern <- zero_or_more("/") %R% one_or_more(ALPHA) %R% DOT %R% one_or_more(ALPHA) %R% zero_or_more("/")
stringi::stri_extract_first_regex(t, pattern)
[1] "/youtube.com/" "/google.com/" "bbc.com/"

Related

Extract all text after last occurrence of a special character

I have the string in R
BLCU142-09|Apodemia_mejicanus
and I would like to get the result
Apodemia_mejicanus
Using the stringr R package, I have tried
str_replace_all("BLCU142-09|Apodemia_mejicanus", "[[A-Z0-9|-]]", "")
# [1] "podemia_mejicanus"
which is almost what I need, except that the A is missing.
You can use
sub(".*\\|", "", x)
This will remove all text up to and including the last pipe char. See the regex demo. Details:
.* - any zero or more chars as many as possible
\| - a | char (| is a special regex metacharacter that is an alternation operator, so it must be escaped, and since string literals in R can contain string escape sequences, the | is escaped with a double backslash).
See the R demo online:
x <- c("BLCU142-09|Apodemia_mejicanus", "a|b|c|BLCU142-09|Apodemia_mejicanus")
sub(".*\\|", "", x)
## => [1] "Apodemia_mejicanus" "Apodemia_mejicanus"
We can match one or more characters that are not a | ([^|]+) from the start (^) of the string followed by | in str_remove to remove that substring
library(stringr)
str_remove(str1, "^[^|]+\\|")
#[1] "Apodemia_mejicanus"
If we use [A-Z] also to match it will match the upper case letter and replace with blank ("") as in the OP's str_replace_all
data
str1 <- "BLCU142-09|Apodemia_mejicanus"
You can always choose to _extract rather than _remove:
s <- "BLCU142-09|Apodemia_mejicanus"
stringr::str_extract(s,"[[:alpha:]_]+$")
## [1] "Apodemia_mejicanus"
Depending on how permissive you want to be, you could also use [[:alpha:]]+_[[:alpha:]]+ as your target.
I would keep it simple:
substring(my_string, regexpr("|", my_string, fixed = TRUE) + 1L)

R Regex capture group?

I have a lot of strings like this:
2019/01/01/07/556662_cba3a4fc-cb8f-4150-859f-5f21a38373d0
I want to extract the substring that lays right after the last "/" and ends with "_":
556662
I have found out how to extract: /01/01/07/556662
by using the following regex: (\/)(.*?)(?=\_)
Please advise how can I capture the right group.
You may use
x <- "2019/01/01/07/556662_cba3a4fc-cb8f-4150-859f-5f21a38373d0"
regmatches(x, regexpr(".*/\\K[^_]+", x, perl=TRUE))
## [1] "556662"
See the regex and R demo.
Here, the regex matches and outputs the first substring that matches
.*/ - any 0+ chars as many as possible up to the last /
\K - omits this part from the match
[^_]+ - puts 1 or more chars other than _ into the match value.
Or, a sub solution:
sub(".*/([^_]+).*", "\\1", x)
See the regex demo.
Here, it is similar to the previous one, but the 1 or more chars other than _ are captured into Group 1 (\1 in the replacement pattern) and the trailing .* make sure the whole input is matched (and consumed, ready to be replaced).
Alternative non-base R solutions
If you can afford or prefer to work with stringi, you may use
library(stringi)
stri_match_last_regex("2019/01/01/07/556662_cba3a4fc-cb8f-4150-859f-5f21a38373d0", ".*/([^_]+)")[,2]
## [1] "556662"
This will match a string up to the last / and will capture into Group 1 (that you access in Column 2 using [,2]) 1 or more chars other than _.
Or
stri_extract_last_regex("2019/01/01/07/556662_cba3a4fc-cb8f-4150-859f-5f21a38373d0", "(?<=/)[^_/]+")
## => [1] "556662"
This will extract the last match of a string that consists of 1 or more chars other than _ and / after a /.
You could use a capturing group:
/([^_/]+)_[^/\s]*
Explanation
/ Match literally
([^_/]+) Capture in a group matching not an underscore or forward slash
_[^/\s]* Match _ and then 0+ times not a forward slash or a whitespace character
Regex demo | R demo
One option to get the capturing group might be to get the second column using str_match:
library(stringr)
str = c("2019/01/01/07/556662_cba3a4fc-cb8f-4150-859f-5f21a38373d0")
str_match(str, "/([^_/]+)_[^/\\s]*")[,2]
# [1] "556662"
I changed the Regex rules according to the code of Wiktor Stribiżew.
x <- "2019/01/01/07/556662_cba3a4fc-cb8f-4150-859f-5f21a38373d0"
regmatches(x, regexpr(".*/([0-9]+)", x, perl=TRUE))
sub(".*/([0-9]+).*", "\\1", x)
Output
[1] "2019/01/01/07/556662"
[1] "556662"
R demo

extracting part of path using sub

I'm attempting to extract a filename from a path in r. In a string like
someurl.com/vp/125514_45147_55144.jpg?_nc25244
I want to extract 125514_45147_55144
I'm using the following expression:
sub(".*vp/(.*?)/.*", "\\1", input)
which works but it also strips the underscores:
1255144514755144
I cannot figure out how to retain the underscores
Remove dot and everything after it of the basename:
sub("\\..*", "", basename(x))
## [1] "125514_45147_55144"
If it is possible that there are dots in the filename then use this slightly more complex pattern:
sub("(.*)\\..*", "\\1", basename(x))
## [1] "125514_45147_55144"
I suggest fixing it as
sub(".*/vp/([^/?]*?)\\.[^/?.]*(?:\\?.*)?$", "\\1", input)
See the regex demo
Details
.* - any 0+ chars as many as possible
/vp/ - a literal substring
([^/?]*?) - Group 1 (its captured value is referenced by \1 from the replacement pattern): any 0+ chars other than / and ?, as few as possible
\\. - a dot
[^/?.]* - 0+ chars other than ., ? and /
(?:\\?.*)? - an optional substring matching ? and then any 0+ chars as many as possible
$ - end of string.
With regmatches/regexec the pattern becomes much clearer:
x <- "someurl.com/vp/125514_45147_55144.jpg?_nc25244"
regmatches(x,regexec("/vp/([^/?]*)\\.",x))[[1]][2]
## => [1] "125514_45147_55144"
See the R demo
stringr alternative
library( stringr )
str_match( "someurl.com/vp/125514_45147_55144.jpg?_nc25244", "^.*/(.*?)\\..*$" )[[2]]
#[1] "125514_45147_55144"
Inspired by the answer of #G.Grothendieck, a regex-free solution using dirname, basename and chartr
x = 'someurl.com/vp/125514_45147_55144.jpg?_nc25244'
dirname(chartr(x = basename(x), ".", "/"))
# [1] "125514_45147_55144"
Assuming there is no dot in the filename.

How to keep only information inside a complex string in R?

I want to keep a string of character inside a complex string. I think that I can use regex to do keep the thing that I need. Basically, I want to keep only the information between the \" and \" in Function=\"SMAD5\". I also want to keep the empty strings: Function=\"\"
df=structure(1:6, .Label = c("ID=Gfo_R000001;Source=ENST00000513418;Function=\"SMAD5\";",
"ID=Gfo_R000002;Source=ENSTGUT00000017468;Function=\"CENPA\";",
"ID=Gfo_R000003;Source=ENSGALT00000028134;Function=\"C1QL4\";",
"ID=Gfo_R000004;Source=ENSTGUT00000015300;Function=\"\";", "ID=Gfo_R000005;Source=ENSTGUT00000019268;Function=\"\";",
"ID=Gfo_R000006;Source=ENSTGUT00000019035;Function=\"\";"), class = "factor")
This should look like this:
"SMAD5"
"CENPA"
"C1QL4"
NA
NA
NA
So far that What I was able to do:
gsub('.*Function=\"',"",df)
[1] "SMAD5\";" "CENPA\";" "C1QL4\";" "\";" "\";" "\";"
But I'm stuck with a bunch of \";". How can I remove them with one line?
I tried this:
gsub('.*Function=\"' & '.\"*',"",test)
But it's giving me this error:
Error in ".*Function=\"" & ".\"*" :
operations are possible only for numeric, logical or complex types
You may use
gsub(".*Function=\"([^\"]*).*","\\1",df)
See the regex demo
Details:
.* - any 0+ chars as many as possible up to the last...
Function=\" - a Function=" substring
([^\"]*) - capturing group 1 matching 0+ chars other than a "
.* - and the rest of the string.
The \1 is the backreference restoring the contents of the Group 1 in the result.
With stringr we can capture groups too:
library(stringr)
matches <- str_match(df, ".*\"(.*)\".*")[,2]
ifelse(matches=='', NA, matches)
# [1] "SMAD5" "CENPA" "C1QL4" NA NA NA
The regular expression can be constructed more readably using rebus.
rx <- 'Function="' %R%
capture(zero_or_more(negated_char_class('"')))
Then matching is as mentioned by Wiktor and sandipan.
rx <- 'Function="' %R% capture(zero_or_more(negated_char_class('"')))
str_match(df, rx)
stri_match_first_regex(df, rx)
gsub(any_char(0, Inf) %R% rx %R% any_char(0, Inf), REF1, df)

how do you extract values between two characters in R?

I am trying to extract the server name (server101) from this string in R using regular expression:
value between # and the following first period (.)
t<-c("Current CPU load - jvm machine[example network-app_svc_group_mem4]#server101.example.com")
I've tried this:
gsub('.*\\#(\\d+),(\\d+).*', '\\1', t)
this does not seem to be working, any ideas?
Since you only expect one match, you may use a simple sub here:
t <- "Current CPU load - jvm machine[example network-app_svc_group_mem4]#server101.example.com"
sub(".*#([^.]+)\\..*", "\\1", t)
## => [1] "server101"
See the R demo online.
Details
.* - any 0+ chars, as many as possible
# - a # char
([^.]+) - Group 1 ("\\1"):
\\. - a dot (other chars you need to escape are $, ^, *, (, ), +, [, \, ?)
.* - any 0+ chars, as many as possible
Here are some alternatives.
You may use the following base R code to extract 1+ characters other than . ([^.]+) after the first #:
> t <- "Current CPU load - jvm machine[example network-app_svc_group_mem4]#server101.example.com"
> pattern="#([^.]+)"
> m <- regmatches(t,regexec(pattern,t))
> result = unlist(m)[2]
> result
[1] "server101"
With regexec, you can access submatches (capturing group contents).
See the online R demo
Another way is to use regmatches/regexpr with a PCRE regex with a (?<=#) lookbehind that only checks for the character presence, but does not put the character into the match:
> result2 <- regmatches(t, regexpr("(?<=#)[^.]+", t, perl=TRUE))
> result2
[1] "server101"
A clean stringr approach would be to use the same PCRE regex with str_extract (that uses a similar (because it also supports lookarounds), ICU, regex flavor):
> library(stringr)
> t<-c("Current CPU load - jvm machine[example network-app_svc_group_mem4]#server101.example.com")
> str_extract(t, "(?<=#)[^.]+")
[1] "server101"
with stringr:
library(stringr)
str_match(t, ".*#([^\\.]*)\\..*")[2]
#[1] "server101"

Resources