Opposite of Hmisc::escapeRegex - r

The function Hmisc::escapeRegex escapes any special characters in a string.
library(Hmisc)
string <- "this\\(system) {is} [full]."
escapeRegex(string)
It is based on the gsub and regexp.
escapestring <- gsub("([.|()\\^{}+$*?]|\\[|\\])", "\\\\\\1", string)
escapestring
[1] "this\\\\\\(system\\) \\{is\\} \\[full\\]\\."
How to remove the backslashes from escapestring so that the original string is retrieved?

You actually only need to keep the character after each \ to un-escape.
string <- "this\\(system) {is} [full]."
library(Hmisc)
gsub("\\\\(.)", "\\1", escapeRegex(string))
#> [1] "this\\(system) {is} [full]."
Alternatively rex may make both escaping and un-escaping a little simpler.
library(rex)
re_substitutes(escape(string), rex("\\", capture(any)), "\\1", global = TRUE)
#> [1] "this\\(system) {is} [full]."

How about the regex
\\\\([.|()\\^{}+$*?]|\\[|\\])
replacing with capture group \1
Example usage
escapestring <- "this\\\\\\(system\\) \\{is\\} \\[full\\]\\."
string <- gsub("\\\\([.|()\\^{}+$*?]|\\[|\\])", "\\1", escapestring)
string
[1] "this\\(system) {is} [full]."

May be this also helps
gsub("\\\\[(](*SKIP)(*F)|\\\\", '', escapestring, perl=TRUE)
#[1] "this\\(system) {is} [full]."

Related

Regex to add comma between any character

I'm relatively new to regex, so bear with me if the question is trivial. I'd like to place a comma between every letter of a string using regex, e.g.:
x <- "ABCD"
I want to get
"A,B,C,D"
It would be nice if I could do that using gsub, sub or related on a vector of strings of arbitrary number of characters.
I tried
> sub("(\\w)", "\\1,", x)
[1] "A,BCD"
> gsub("(\\w)", "\\1,", x)
[1] "A,B,C,D,"
> gsub("(\\w)(\\w{1})$", "\\1,\\2", x)
[1] "ABC,D"
Try:
x <- 'ABCD'
gsub('\\B', ',', x, perl = T)
Prints:
[1] "A,B,C,D"
Might have misread the query; OP is looking to add comma's between letters only. Therefor try:
gsub('(\\p{L})(?=\\p{L})', '\\1,', x, perl = T)
(\p{L}) - Match any kind of letter from any language in a 1st group;
(?=\p{L}) - Positive lookahead to match as per above.
We can use the backreference to this capture group in the replacement.
You can use
> gsub("(.)(?=.)", "\\1,", x, perl=TRUE)
[1] "A,B,C,D"
The (.)(?=.) regex matches any char capturing it into Group 1 (with (.)) that must be followed with any single char ((?=.)) is a positive lookahead that requires a char immediately to the right of the current location).
Vriations of the solution:
> gsub("(.)(?!$)", "\\1,", x, perl=TRUE)
## Or with stringr:
## stringr::str_replace_all(x, "(.)(?!$)", "\\1,")
[1] "A,B,C,D"
Here, (?!$) fails the match if there is an end of string position.
See the R demo online:
x <- "ABCD"
gsub("(.)(?=.)", "\\1,", x, perl=TRUE)
# => [1] "A,B,C,D"
gsub("(.)(?!$)", "\\1,", x, perl=TRUE)
# => [1] "A,B,C,D"
stringr::str_replace_all(x, "(.)(?!$)", "\\1,")
# => [1] "A,B,C,D"
A non-regex friendly answer:
paste(strsplit(x, "")[[1]], collapse = ",")
#[1] "A,B,C,D"
Another option is to use positive look behind and look ahead to assert there is a preceding and a following character:
library(stringr)
str_replace_all(x, "(?<=.)(?=.)", ",")
[1] "A,B,C,D"

How to extract text between two separators in R?

I have a vector of strings like so:
mystr <- c("./10g/13.9264.csv", "./6g/62.0544.csv")
I only want the part between the two forward slashes, i.e., "10g" and "6g".
You could sub() here with a capture group:
mystr <- c("./10g/13.9264.csv", "./6g/62.0544.csv")
sub(".*/([^/]+)/.*", "\\1", mystr)
[1] "10g" "6g"
similar to Tim Biegeleisen, but with a lookbehind and lookahead, using srt_extract from stringr:
library(stringr)
mystr <- c("./10g/13.9264.csv", "./6g/62.0544.csv")
str_extract(mystr,"(?<=/)[^/]+(?=/)")
[1] "10g" "6g"
More simply you can capitalize on the fact that the desired substring is one or more digits followed by literal g:
library(stringr)
str_extract(mystr, "\\d+g")
[1] "10g" "6g"
Here are a few alternatives. They use no packages and the first two do not use any regular expressions.
basename(dirname(mystr))
## [1] "10g" "6g"
read.table(text = mystr, sep = "/")[[2]]
## [1] "10g" "6g"
trimws(trimws(mystr,, "[^/]"),, "/")
## [1] "10g" "6g"
We could also reformulate these using pipes
mystr |> dirname() |> basename()
## [1] "10g" "6g"
read.table(text = mystr, sep = "/") |> (`[[`)(2)
## [1] "10g" "6g"
mystr |> trimws(, "[^/]") |> trimws(, "/")
## [1] "10g" "6g"
Note
From the question the input is
mystr <- c("./10g/13.9264.csv", "./6g/62.0544.csv")

How to apply the function substr to each element of a string

I have this string of words
string<-c("chair-desk-tree-table-computer-mousse")
I want to retrieve the first three characters of each word and store them in an object like that:
newstring==> [1] "cha-des-tre-tab-com-mou"
> newstring <- substring( strsplit(string, "-")[[1]], 1, 3)
> newstring <- paste0(newstring, collapse = "-")
> newstring
[1] "cha-des-tre-tab-com-mou"
Using gsub with a regex lookaround to match one or more lower case letters that precede 3 lower case letters
gsub("(?<=\\b[a-z]{3})[a-z]+", "", string, perl = TRUE)
[1] "cha-des-tre-tab-com-mou"
Using the edited string
> string <- c(string, "K29-E665-I1190")
> gsub("(?<=\\b[[:alnum:]]{3})[[:alnum:]]+", "", string, perl = TRUE)
[1] "cha-des-tre-tab-com-mou" "K29-E66-I11"

How to extract string after 2nd delimiter in R

I have my vector as
dt <- c("1:7984985:A:G", "1:7984985-7984985:A:G", "1:7984985-7984985:T:G")
I would like to extract everything after 2nd :.
The result I would like is
A:G , A:G, T:G
What would be the solution for this?
We can use sub to match two instances of one or more characters that are not a : ([^:]+) followed by : from the start (^) of the string and replace it with blank ("")
sub("^([^:]+:){2}", "", dt)
#[1] "A:G" "A:G" "T:G"
It can be also done with trimws (if it is not based on position)
trimws(dt, whitespace = "[-0-9:]")
#[1] "A:G" "A:G" "T:G"
Or using str_remove from stringr
library(stringr)
str_remove(dt, "^([^:]+:){2}")
#[1] "A:G" "A:G" "T:G"
You can use sub, capture the items you want to retain in a capturing group (...) and refer back to them in the replacement argument to sub:
sub("^.:[^:]+:(.:.)", "\\1", dt, perl = T)
[1] "A:G" "A:G" "T:G"
Alternatively, you can use str_extract and positive lookbehind (?<=...):
library(stringr)
str_extract(dt, "(?<=:)[A-Z]:[A-Z]")
[1] "A:G" "A:G" "T:G"
Or simply use str_split which returns a list of 2 values.
´str_split("1:7984985:A:G", "\:",n=3)[[1]][3]´

Use gsub to keep only the first part of my string

In R, I have strings looking like:
test <- 'ZYG11B|79699'
I want to keep only 'ZYG11B'.
My best attempt yet:
gsub ("|.*$", "", test) # should replace everything after '|' by nothing
but returns
> [1] ""
How should I do that?
It's a protected character which means it should be enclosed in square brackets or escaped with double slashes:
> gsub('[|].*$','', test)
[1] "ZYG11B"
> gsub('\\|.*$','', test)
[1] "ZYG11B"
We can do
library(stringr)
str_extract(test, "\\w+")
#[1] "ZYG11B"

Resources