Regex match after last / and first underscore - r

Assuming I have the following string:
string = "path/stack/over_flow/Pedro_account"
I am intrested in matching the first 2 characters after the last / and before the first _. So in this case the desired out put is:
Pe
What I have so far is a mix of substr and str_extract:
substr(str_extract(string, "[^/]*$"),1,2)
which of course will give an answer but I belive there is a nice regex for it as well, and that is what I'm looking for.

You can use
library(stringr)
str_extract(string, "(?<=/)[^/]{2}(?=[^/]*$)")
## => [1] "Pe"
See the R demo and the regex demo. Details:
(?<=/) - a location immediately preceded with a / char
[^/]{2} - two chars other than /
(?=[^/]*$) - a location immediately preceded with zero or more chars other than / till the end of string.

Using basename to get the last folder name, then substring:
substr(basename("path/stack/over_flow/Pedro_account"), 1, 2)
# [1] "Pe"

Remove everything till last / and extract first 2 characters.
Base R -
string = "path/stack/over_flow/Pedro_account"
substr(sub('.*/', '', string), 1, 2)
#[1] "Pe"
stringr
substr(stringr::str_remove(string, '.*/'), 1, 2)

You can use str_match with a capture group:
/ Match literally
([^/_]{2}) Capture 2 chars other than / or _ in group 1
[^/]* Match optional chars other than /
$ End of string
See a regex demo and a R demo.
Example
library(stringr)
string = "path/stack/over_flow/Pedro_account"
str_match(string, "/([^/_]{2})[^/]*$")[,2]
Output
[1] "Pe"

Related

Remove all punctuation except underline between characters in R with POSIX character class

I would like to use R to remove all underlines expect those between words. At the end the code removes underlines at the end or at the beginning of a word.
The result should be
'hello_world and hello_world'.
I want to use those pre-built classes. Right know I have learn to expect particular characters with following code but I don't know how to use the word boundary sequences.
test<-"hello_world and _hello_world_"
gsub("[^_[:^punct:]]", "", test, perl=T)
You can use
gsub("[^_[:^punct:]]|_+\\b|\\b_+", "", test, perl=TRUE)
See the regex demo
Details:
[^_[:^punct:]] - any punctuation except _
| - or
_+\b - one or more _ at the end of a word
| - or
\b_+ - one or more _ at the start of a word
One non-regex way is to split and use trimws by setting the whitespace argument to _, i.e.
paste(sapply(strsplit(test, ' '), function(i)trimws(i, whitespace = '_')), collapse = ' ')
#[1] "hello_world and hello_world"
We can remove all the underlying which has a word boundary on either of the end. We use positive lookahead and lookbehind regex to find such underlyings. To remove underlying at the start and end we use trimws.
test<-"hello_world and _hello_world_"
gsub("(?<=\\b)_|_(?=\\b)", "", trimws(test, whitespace = '_'), perl = TRUE)
#[1] "hello_world and hello_world"
You could use:
test <- "hello_world and _hello_world_"
output <- gsub("(?<![^\\W])_|_(?![^\\W])", "", test, perl=TRUE)
output
[1] "hello_world and hello_world"
Explanation of regex:
(?<![^\\W]) assert that what precedes is a non word character OR the start of the input
_ match an underscore to remove
| OR
_ match an underscore to remove, followed by
(?![^\\W]) assert that what follows is a non word character OR the end of the input

remove all characters between string and bracket in R

Say I have a dataframe df in which a column df$strings contains strings like
[cat 00.04;09]
[cat 00.04;10]
and so on. I want to remove all characters between "[cat" and "]" to yield
[cat]
[cat]
I've tried this using gsub but it's not working and I'm not sure what I'm doing wrong:
gsub('cat*?\\]', '', df)
Note that cat*?\\] patten matches ca, then any 0+ t chars but as few as possible and then ].
You want to match any chars other than ] between [cat and ]:
gsub('\\[cat[^]]*\\]', '[cat]', df$strings)
Here,
\\[ - matches [
cat - matches cat
[^]]* - 0+ chars other than ] (note that ] inside the bracket expression should not be escaped when placed at the start - else, if you escape it, you will need to add perl=TRUE argument since PCRE regex engine can handle regex escapes inside bracket expressions (not the default TRE))
\\] - a ] (you do not even need to escape it, you may just use ]).
See the R demo:
x <- c("[cat 00.04;09]", "[cat 00.04;10]")
gsub('\\[cat[^]]*\\]', '[cat]', x)
## => [1] "[cat]" "[cat]"
If cat can be any word, use
gsub('\\[(\\w+)[^]]*\\]', '[\\1]', x)
where (\\w+) is a capturing group with ID=1 that matches 1 or more word chars, and \\1 in the replacement pattern is a replacement backreference that stands for the group value.

Regex: Extracting numbers from parentheses with multiple matches

How do I match the year such that it is general for the following examples.
a <- '"You Are There" (1953) {The Death of Socrates (399 B.C.) (#1.14)}'
b <- 'Þegar það gerist (1998/I) (TV)'
I have tried the following, but did not have the biggest success.
gsub('.+\\(([0-9]+.+\\)).?$', '\\1', a)
What I thought it did was to go until it finds a (, then it would make a group of numbers, then any character until it meets a ). And if there are several matches, I want to extract the first group.
Any suggestions to where I go wrong? I have been doing this in R.
You could use
library(stringr)
strings <- c('"You Are There" (1953) {The Death of Socrates (399 B.C.) (#1.14)}', 'Þegar það gerist (1998/I) (TV)')
years <- str_match(strings, "\\((\\d+(?: B\\.C\\.)?)")[,2]
years
# [1] "1953" "1998"
The expression here is
\( # (
(\d+ # capture 1+ digits
(?: B\.C\.)? # B.C. eventually
)
Note that backslashes need to be escaped in R.
Your pattern contains .+ parts that match 1 or more chars as many as possible, and at best your pattern could grab last 4 digit chunks from the incoming strings.
You may use
^.*?\((\d{4})(?:/[^)]*)?\).*
Replace with \1 to only keep the 4 digit number. See the regex demo.
Details
^ - start of string
.*? - any 0+ chars as few as possible
\( - a (
(\d{4}) - Group 1: four digits
(?: - start of an optional non-capturing group
/ - a /
[^)]* - any 0+ chars other than )
)? - end of the group
\) - a ) (OPTIONAL, MAY BE OMITTED)
.* - the rest of the string.
See the R demo:
a <- c('"You Are There" (1953) {The Death of Socrates (399 B.C.) (#1.14)}', 'Þegar það gerist (1998/I) (TV)', 'Johannes Passion, BWV. 245 (1725 Version) (1996) (V)')
sub("^.*?\\((\\d{4})(?:/[^)]*)?\\).*", "\\1", a)
# => [1] "1953" "1998" "1996"
Another base R solution is to match the 4 digits after (:
regmatches(a, regexpr("\\(\\K\\d{4}(?=(?:/[^)]*)?\\))", a, perl=TRUE))
# => [1] "1953" "1998" "1996"
The \(\K\d{4} pattern matches ( and then drops it due to \K match reset operator and then a (?=(?:/[^)]*)?\\)) lookahead ensures there is an optional / + 0+ chars other than ) and then a ). Note that regexpr extracts the first match only.

replace last number in string using regex

I want to replace the last number in a string using regex and gsub
S <- "abcd2efghi2.txt"
The last number and the position of the last number can vary.
So I've tried the regex
?<=[\d+])\b
gsub("?<=[\d+])\b", "", S)
but that doesn't seem to work
Appreciate any help.
You can achieve that with a default TRE engine using the following regex:
\d+(\D*)$
Replace with the \1 backreference.
Details
\d+ - 1 or more digits
(\D*) - Capturing group 1: any 0+ non-digit symbols
$ - end of string
\1 - a backreference to the Group 1 value (so as to restore the text matched and consumed with the (\D*) subpattern).
See the regex demo.
R code demo:
sub("\\d+(\\D*)$", "\\1", S)
## => [1] "abcd2efghi.txt"
You could use this regex:
\d+(?=\D*$)
It matches a sequence of digits when everything that follows consists of non-digits (\D) until the end of the string ($).

extract string from in R using regex

I have this vector:
jvm<-c("test - PROD_DB_APP_185b#SERVER01" ,"uat - PROD_DB_APP_SYS[1]#SERVER2")
I need to extract text until "[" or if there is no "[", then until the "#" character.
result should be
PROD_DB_APP_185b
PROD_DB_APP_SYS
I've tried something like this:
str_match(jvm, ".*\\-([^\\.]*)([.*)|(#.*)")
not working, any ides?
A sub solution with base R:
jvm<-c("test - PROD_DB_APP_185b#SERVER01" ,"uat - PROD_DB_APP_SYS[1]#SERVER2")
sub("^.*?\\s+-\\s+([^#[]+).*", "\\1", jvm)
See the online R demo
Details:
^ - start of string
.*? - any 0+ chars as few as possible
\\s+-\\s+ - a hyphen enclosed with 1 or more whitespaces
([^#[]+) - capturing group 1 matching any 1 or more chars other than #
and [
.* - any 0+ chars, up to the end of string.
Or a stringr solution with str_extract:
str_extract(jvm, "(?<=-\\s)[^#\\[]+")
See the regex demo
Details:
(?<=-\\s) - a positive lookbehind that matches an empty string that is preceded with a - and a whitespace immediately to the left of the current location
[^#\\[]+ - 1 or more chars other than # and [.

Resources