I'd like to use gsub to remove characters from a filename.
In the example below the desired output is 23
digs = "filepath/23-00.xlsx"
I can remove everything before 23 as follows:
gsub("^\\D+", "",digs)
[1] "23-00.xlsx"
or everything after:
gsub("\\-\\d+\\.xlsx$","", digs)
[1] "filepath/23"
How do I do both at the same time?
We could use | (OR) i.e. match characters (.*) till the / or (|), match the - followed by characters (.*), replace with blank ("")
gsub(".*/|-.*", "", digs)
[1] "23"
Or just do parse_number
readr::parse_number(digs)
[1] 23
You can just use a sub like
sub("^\\D+(\\d+).*", "\\1", digs)
# => [1] "23"
See the R demo. See the regex demo. Details:
^ - start of string
\D+ - one or more non-digit chars
(\d+) - Group 1 (\1 refers to this group value): one or more digits
.* - any zero or more chars as many as possible.
Related
I want to extract the numbers after the 1st underscore (_), but I don't know why just only 1 number digit is selected.
My sample data is:
myvec<-c("increa_0_1-1","increa_9_25-112","increa_25-50-76" )
as.numeric(gsub("(.*_){1}(\\d)_.+", "\\2", myvec))
[1] 0 9 NA
Warning message:
NAs introduced by coercion
I'd like:
[1] 0 9 25
Please, any help with it?
Some explanation. We are interested in digits coming after _. [0-9] captures the digits, where the + says that we want to match any number of digits in a row. (?<=_) 'looks behind' the digit and makes sure we are only capturing digits preceded by a _.
library(stringr)
str_extract(myvec, "(?<=_)[0-9]+")
[1] "0" "9" "25"
Another possible solution, based on stringr::str_extract:
library(stringr)
myvec<-c("increa_0_1-1","increa_9_25-112","increa_25-50-76" )
as.numeric(str_extract(myvec, "(?<=_)\\d+"))
#> [1] 0 9 25
You can use sub (because you will need a single search and replace operation) with a pattern like ^[^_]*_(\d+).*:
myvec<-c("increa_0_1-1","increa_9_25-112","increa_25-50-76" )
sub("^[^_]*_(\\d+).*", "\\1", myvec)
# => [1] "0" "9" "25"
See the R demo and the regex demo.
Regex details:
^ - start of string
[^_]* - a negated character class that matches any zero or more chars other than _
_ - a _ char
(\d+) - Group 1 (\1 refers to the value captured into this group from the replacement pattern): one or more digits
.* - the rest of the string (. in TRE regex matches line break chars by default).
If you want to extract the first number after the first underscore, you can use a capture group with str_match and the pattern _([0-9]+)
Note to repeat the character class (or \\d+) one or more times.
For example
library(stringr)
myvec<-c("increa_0_1-1","increa_9_25-112","increa_25-50-76" )
str_match(myvec, "_([0-9]+)")[,2]
Output
[1] "0" "9" "25"
See a R demo
myvec<-c("increa_0_1-1","increa_9_25-112","increa_25-50-76" )
as.numeric(gsub("[^_]*_(\\d+).*", "\\1", myvec))
[1] 0 9 25
This must be easy, but I can't handle it. Sorry for that!
I have this string:
string <- c("AB1C1", "AB2C2", "AB3C20")
[1] "AB1C1" "AB2C2" "AB3C20"
I would like to ADD an underscore before the last character followed by any digit.
Desired output:
[1] "AB1_C1" "AB2_C2" "AB3_C20"
I have tried so far:
I can match with regex: [A-Z][0-9]+$ the last character followed by any digit.
But I don't know how to ADD an underscore before this match
You can use
sub("(.*)(\\D\\d+)$", "\\1_\\2", string)
## => [1] "AB1_C1" "AB2_C2" "AB3_C20"
sub("(\\D\\d+)$", "_\\1", string)
## => [1] "AB1_C1" "AB2_C2" "AB3_C20"
See the regex demo / regex demo #2. Details:
(.*) - Group 1: any zero or more chars as many as possible
(\D\d+) - Group 2: any non-digit and then one or more digits
$ - end of string.
See the R demo:
string <- c("AB1C1", "AB2C2", "AB3C20")
sub("(.*)(\\D\\d+)$", "\\1_\\2", string)
## => [1] "AB1_C1" "AB2_C2" "AB3_C20"
I have a lot of strings like this:
2019/01/01/07/556662_cba3a4fc-cb8f-4150-859f-5f21a38373d0
I want to extract the substring that lays right after the last "/" and ends with "_":
556662
I have found out how to extract: /01/01/07/556662
by using the following regex: (\/)(.*?)(?=\_)
Please advise how can I capture the right group.
You may use
x <- "2019/01/01/07/556662_cba3a4fc-cb8f-4150-859f-5f21a38373d0"
regmatches(x, regexpr(".*/\\K[^_]+", x, perl=TRUE))
## [1] "556662"
See the regex and R demo.
Here, the regex matches and outputs the first substring that matches
.*/ - any 0+ chars as many as possible up to the last /
\K - omits this part from the match
[^_]+ - puts 1 or more chars other than _ into the match value.
Or, a sub solution:
sub(".*/([^_]+).*", "\\1", x)
See the regex demo.
Here, it is similar to the previous one, but the 1 or more chars other than _ are captured into Group 1 (\1 in the replacement pattern) and the trailing .* make sure the whole input is matched (and consumed, ready to be replaced).
Alternative non-base R solutions
If you can afford or prefer to work with stringi, you may use
library(stringi)
stri_match_last_regex("2019/01/01/07/556662_cba3a4fc-cb8f-4150-859f-5f21a38373d0", ".*/([^_]+)")[,2]
## [1] "556662"
This will match a string up to the last / and will capture into Group 1 (that you access in Column 2 using [,2]) 1 or more chars other than _.
Or
stri_extract_last_regex("2019/01/01/07/556662_cba3a4fc-cb8f-4150-859f-5f21a38373d0", "(?<=/)[^_/]+")
## => [1] "556662"
This will extract the last match of a string that consists of 1 or more chars other than _ and / after a /.
You could use a capturing group:
/([^_/]+)_[^/\s]*
Explanation
/ Match literally
([^_/]+) Capture in a group matching not an underscore or forward slash
_[^/\s]* Match _ and then 0+ times not a forward slash or a whitespace character
Regex demo | R demo
One option to get the capturing group might be to get the second column using str_match:
library(stringr)
str = c("2019/01/01/07/556662_cba3a4fc-cb8f-4150-859f-5f21a38373d0")
str_match(str, "/([^_/]+)_[^/\\s]*")[,2]
# [1] "556662"
I changed the Regex rules according to the code of Wiktor Stribiżew.
x <- "2019/01/01/07/556662_cba3a4fc-cb8f-4150-859f-5f21a38373d0"
regmatches(x, regexpr(".*/([0-9]+)", x, perl=TRUE))
sub(".*/([0-9]+).*", "\\1", x)
Output
[1] "2019/01/01/07/556662"
[1] "556662"
R demo
I have the following code with a regex
CHARACTER <- ^([A-Z0-9 .])+(?:\\(V\\.O\\.\\))?(?:\\(O\\.S\\.\\))?(?:\\(CONT'D\\))?$
str_match("WILL (V.O.)",CHARACTER)[1,2]
I thought this should match the value of "WILL " but it is returning blank.
I tried the RegEx in a different language and the group is coming back blank in that instance also.
What do I have to add to this regex to pull back just the value "WILL"?
You formed a repeated capturing group by placing + outside a group. Put it back:
CHARACTER <- "^([A-Z0-9 .]+)(?:\\(V\\.O\\.\\))?(?:\\(O\\.S\\.\\))?(?:\\(CONT'D\\))?$"
^
Note you may trim Will if you use a lazy match with \s* after the group:
CHARACTER <- "^([A-Z0-9\\s.]+?)\\s*(?:\\(V\\.O\\.\\))?(?:\\(O\\.S\\.\\))?(?:\\(CONT'D\\))?$"
See this regex demo.
> library(stringr)
> CHARACTER <- "^([A-Z0-9\\s.]+?)\\s*(?:\\(V\\.O\\.\\))?(?:\\(O\\.S\\.\\))?(?:\\(CONT'D\\))?$"
> str_match("WILL (V.O.)",CHARACTER)[1,2]
[1] "WILL"
Alternatively, you may just extract Will with
> str_extract("WILL (V.O.)", "^.*?(?=\\s*(?:\\(V\\.O\\.\\))?(?:\\(O\\.S\\.\\))?(?:\\(CONT'D\\))?$)")
[1] "WILL"
Or the same with base R:
> regmatches(x, regexpr("^.*?(?=\\s*(?:\\(V\\.O\\.\\))?(?:\\(O\\.S\\.\\))?(?:\\(CONT'D\\))?$)", x, perl=TRUE))
[1] "WILL"
Here,
^ - matches the start of a string
.*? - any 0+ chars other than line break chars as few as possible
(?=\\s*(?:\\(V\\.O\\.\\))?(?:\\(O\\.S\\.\\))?(?:\\(CONT'D\\))?$) - makes sure that, immediately to the right of the current location, there is
\\s* - 0+ whitespaces
(?:\\(V\\.O\\.\\))? - an optional (V.O.) substring
(?:\\(O\\.S\\.\\))? - an optional (O.S.) substring
(?:\\(CONT'D\\))? - an optional (CONT'D) substring
$ - end of string.
I need to capture TEST_WF1_CORP[-application-com.ibm.ws.runtime.WsServer] from the following string, basically from - to # sign.
i<-c("Current CPU load - TEST_WF1_CORP[-application-com.ibm.ws.runtime.WsServer]#example1.com")
I've tried this:
str_match(i, ".*-([^\\.]*)\\#.*")[,2]
I am getting NA, any ideas?
1) gsub Replace everything up to and including -, i.e. .* -, and everything after and including #, i.e. #.*, with a zero length string. No packages are needed:
gsub(".* - |#.*", "", i)
## "TEST_WF1_CORP[-application-com.ibm.ws.runtime.WsServer]"
2) sub This would also work. It matches everything to space, minus, space (i.e. .* -) and then captures everything until # (i.e. (.*)# ) followed by whatever is left (.*) and replaces that with the capture group, i.e. the part within parens. It also uses no packages.
sub(".*- (.*)#.*", "\\1", i)
## [1] "TEST_WF1_CORP[-application-com.ibm.ws.runtime.WsServer]"
Note: We used this as input i:
i <- "Current CPU load - TEST_WF1_CORP[-application-com.ibm.ws.runtime.WsServer]#example1.com"
The following should work:
extract <- unlist(strsplit(i,"- |#"))[2]
You may use
-\s*([^#]+)
See the regex demo
Details:
- - a hyphen
\s* - zero or more whitespaces
([^#]+) - Group 1 capturing 1 or more chars other than #.
R demo:
> library(stringr)
> i<-c("Current CPU load - TEST_WF1_CORP[-application-com.ibm.ws.runtime.WsServer]#example1.com")
> str_match(i, "-\\s*([^#]+)")[,2]
[1] "TEST_WF1_CORP[-application-com.ibm.ws.runtime.WsServer]"
The same pattern can be used with base R regmatches/regexec:
> regmatches(i, regexec("-\\s*([^#]+)", i))[[1]][2]
[1] "TEST_WF1_CORP[-application-com.ibm.ws.runtime.WsServer]"
If you prefer a replacing approach you may use a sub:
> sub(".*?-\\s*([^#]+).*", "\\1", i)
[1] "TEST_WF1_CORP[-application-com.ibm.ws.runtime.WsServer]"
Here, .*? matches any 0+ chars, as few as possible, up to the first -, then -, 0+ whitespaces (\\s*), then 1+ chars other than # are captured into Group 1 (see ([^#]+)) and then .* matches the rest of the string. The \1 in the replacement pattern puts the contents of Group 1 back into the replacement result.