regex for replacement of specific character outside parenthesis only - r

I am looking for regex (preferably in R) which can replace (any number of) specific characters say ; with say ;; but only when not present inside parenthesis () inside the text string.
Note: 1. There may be more than one replacement character present inside parenthesis too
2. There are no nested parenthesis in the data/vector
Example
text;othertext to be replaced with text;;othertext
but text;other(texttt;some;someother);more to be replaced with text;;other(texttt;some;someother);;more. (i.e. ; only outside () to be replaced with replacement text)
Still if some clarification is needed I will try to explain
in_vec <- c("abcd;ghi;dfsF(adffg;adfsasdf);dfg;(asd;fdsg);ag", "zvc;dfasdf;asdga;asd(asd;hsfd)", "adsg;(asdg;ASF;DFG;ASDF;);sdafdf", "asagf;(fafgf;sadg;sdag;a;gddfg;fd)gsfg;sdfa")
in_vec
#> [1] "abcd;ghi;dfsF(adffg;adfsasdf);dfg;(asd;fdsg);ag"
#> [2] "zvc;dfasdf;asdga;asd(asd;hsfd)"
#> [3] "adsg;(asdg;ASF;DFG;ASDF;);sdafdf"
#> [4] "asagf;(fafgf;sadg;sdag;a;gddfg;fd)gsfg;sdfa"
Expected output (calculated manually)
[1] "abcd;;ghi;;dfsF(adffg;adfsasdf);;dfg;;(asd;fdsg);;ag"
[2] "zvc;;dfasdf;;asdga;;asd(asd;hsfd)"
[3] "adsg;;(asdg;ASF;DFG;ASDF;);;sdafdf"
[4] "asagf;;(fafgf;sadg;sdag;a;gddfg;fd)gsfg;;sdfa"

You can use gsub with ;(?![^(]*\\)):
gsub(";(?![^(]*\\))", ";;", in_vec, perl=TRUE)
#[1] "abcd;;ghi;;dfsF(adffg;adfsasdf);;dfg;;(asd;fdsg);;ag"
#[2] "zvc;;dfasdf;;asdga;;asd(asd;hsfd)"
#[3] "adsg;;(asdg;ASF;DFG;ASDF;);;sdafdf"
#[4] "asagf;;(fafgf;sadg;sdag;a;gddfg;fd)gsfg;;sdfa"
; finds ;, (?!) .. Negative Lookahead (make the replacement when it does not match), [^(] .. everything but not (, * repeat the previous 0 to n times, \\) .. flowed by ).
Or
gsub(";(?=[^)]*($|\\())", ";;", in_vec, perl=TRUE)
#[1] "abcd;;ghi;;dfsF(adffg;adfsasdf);;dfg;;(asd;fdsg);;ag"
#[2] "zvc;;dfasdf;;asdga;;asd(asd;hsfd)"
#[3] "adsg;;(asdg;ASF;DFG;ASDF;);;sdafdf"
#[4] "asagf;;(fafgf;sadg;sdag;a;gddfg;fd)gsfg;;sdfa"
; finds ;, (?=) .. Positive Lookahead (make the replacement when it does match), [^)] .. everything but not ), * repeat the previous 0 to n times, ($|\\() .. match end $ or (.
Or using gregexpr and regmatches extracting the part between ( and ) and making the replacement in the non-matched substrings:
x <- gregexpr("\\(.*?\\)", in_vec) #Find the part between ( and )
mapply(function(a, b) {
paste(matrix(c(gsub(";", ";;", b), a, ""), 2, byrow=TRUE), collapse = "")
}, regmatches(in_vec, x), regmatches(in_vec, x, TRUE))
#[1] "abcd;;ghi;;dfsF(adffg;adfsasdf);;dfg;;(asd;fdsg);;ag"
#[2] "zvc;;dfasdf;;asdga;;asd(asd;hsfd)"
#[3] "adsg;;(asdg;ASF;DFG;ASDF;);;sdafdf"
#[4] "asagf;;(fafgf;sadg;sdag;a;gddfg;fd)gsfg;;sdfa"
But all of them will work only for simple open ( close ) combinations.

Though the problem can be tackled with regex, using a simple function might be more straightforward and easier to understand.
replace_semicolons_outside_parentheses <- function(raw_string) {
"""Replace ; with ;; outside of parentheses"""
processed_string <- ""
n_open_parentheses <- 0
# Loops over characters in raw_string
for (char in strsplit(raw_string, "")[[1]]) {
# Update the net number of open parentheses
if (char == "(") {
n_open_parentheses <- n_open_parentheses + 1
} else if (char == ")") {
n_open_parentheses <- n_open_parentheses - 1
}
# Replace ; with ;; outside of parentheses
if (char == ";" && n_open_parentheses == 0) {
processed_string <- paste0(processed_string, ";;")
} else {
processed_string <- paste0(processed_string, char)
}
}
return(processed_string)
}
Note that the function above also works for nested parentheses: no semicolons inside nested parentheses are replaced! The desired output can be obtained in a single line:
out_vec <- lapply(in_vec, replace_semicolons_outside_parentheses)
# 1. 'abcd;;ghi;;dfsF(adffg;adfsasdf);;dfg;;(asd;fdsg);;ag'
# 2. 'zvc;;dfasdf;;asdga;;asd(asd;hsfd)'
# 3. 'adsg;;(asdg;ASF;DFG;ASDF;);;sdafdf'
# 4. 'asagf;;(fafgf;sadg;sdag;a;gddfg;fd)gsfg;;sdfa'

Use the following in case of no nested parentheses:
gsub("\\([^()]*\\)(*SKIP)(*FAIL)|;", ";;", in_vec, perl=TRUE)
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
\( '(' char
--------------------------------------------------------------------------------
[^()]* any character except: '(', ')' (0 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
\) ')' char
--------------------------------------------------------------------------------
(*SKIP)(*FAIL) skip current match, search for new one from here
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
; ';'
If there are nested parentheses:
gsub("(\\((?:[^()]++|(?1))*\\))(*SKIP)(*FAIL)|;", ";;", in_vec, perl=TRUE)
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
\( '('
--------------------------------------------------------------------------------
(?: group, but do not capture:
--------------------------------------------------------------------------------
[^()]++ any character except: '(', ')' (1 or more times
(matching the most amount possible, no backtracking))
--------------------------------------------------------------------------------
| or
--------------------------------------------------------------------------------
(?1) recursing first group pattern
--------------------------------------------------------------------------------
) end of grouping
--------------------------------------------------------------------------------
\) ')'
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
(*SKIP)(*FAIL) skip the match, search for next
--------------------------------------------------------------------------------
| or
--------------------------------------------------------------------------------
; ';'
--------------------------------------------------------------------------------

Related

R: add a character to a specific spot in string, trouble with regex syntax

I have a list of string like so:
batch1, batch2, batch3, batch10, batch11
I am trying to add a 0 before the single digits batch01, batch02, batch03, batch10, batch11
I have found many similar questions and tried to write my own regex. I am very close, but I can't quite make it do what I want.
Batch <- gsub('(.{5})([0-9]{1}\\b)','\\10\\2', Batch)
outputs batch01, batch02, batch 03, batch100, batch110
\\s instead of \\b doesn't change any values
sampleNames$Batch <- gsub('(.{5})([0-9]{1})','\\10\\2', sampleNames$Batch) outputs bacth01, batch02, batch03, batch010, batch011
I've played around with a few other versions but I cannot seem to get it correct. I know this is a somewhat repetitive question, but I have not been able to alter previous solutions to do what I need to do.
We can capture the last digit and the lower case letter before it as two groups, then in the replacement specify the backreference of the groups and the 0 in between. Thus, it won't match the ones having two digits at the end of the string
sub("([a-z])(\\d)$", "\\10\\2", Batch)
[1] "batch01" "batch02" "batch03" "batch10" "batch11"
Or we may use sprintf/str_pad with str_replace
library(stringr)
str_replace(Batch, "\\d+$", function(x) sprintf("%02d", as.numeric(x)))
[1] "batch01" "batch02" "batch03" "batch10" "batch11"
data
Batch <- c("batch1", "batch2", "batch3", "batch10", "batch11")
Use
sampleNames$Batch <- sub("(\\D|^)(\\d)$", "\\10\\2", sampleNames$Batch, perl=TRUE)
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
\D non-digits (all but 0-9)
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
( group and capture to \2:
--------------------------------------------------------------------------------
\d digits (0-9)
--------------------------------------------------------------------------------
) end of \2
--------------------------------------------------------------------------------
$ before an optional \n, and the end of the
string
You can also use the following solution:
sapply(vec, function(x) {
d <- gsub("([[:alpha:]]+)(\\d)", "\\2", x)
if(nchar(d) == 1) {
gsub("([[:alpha:]]+)(\\d)", "\\10\\2", x)
} else {
x
}
})
batch1 batch2 batch3 batch10 batch11
"batch01" "batch02" "batch03" "batch10" "batch11"

Match only parenthesis with text and numbers in R

I would like to replace the parenthesis and the text between parenthesis in string variables. However I only want to replace those parenthesis with at least one number in it.
Example string:
text <- c("Sekretär (dipl.) (G3)", "Zolldeklarant (3 Jahre)", "Grenzwächter (< 2 Jahre)")
I tried the following:
str_extract_all(text, " *\\(.*?\\d+.*?\\) *")
It does extract the text in parenthesis, but in the first one, it matches also the first parenthesis without any number.
The extraction should look like:
" (G3)"
" (3 Jahre)"
" (< 2 Jahre)"
If you want to replace these terms in parentheses, containing at least one number, then sub is a good base R option:
text
sapply(text, function (x) {
gsub("\\([^()]*\\d[^()]*\\)", "REMOVED", x)
})
[1] "Sekretär (dipl.) (G3)" "Zolldeklarant (3 Jahre)" "Grenzwächter (< 2 Jahre)"
[1] "Sekretär (dipl.) REMOVED" "Zolldeklarant REMOVED" "Grenzwächter REMOVED"
I have replaced with the literal text REMOVED just as a placeholder to show the replacement.
Edit:
If you just want to extract these terms, we can also use sub for this:
sapply(text, function (x) {
gsub(".*(\\([^()]*\\d[^()]*\\)).*", "\\1", x)
})
[1] "(G3)" "(3 Jahre)" "(< 2 Jahre)"
Here, we capture the term in parentheses, then replace the entire string with just the first (and only) capture group \\1.
You can use
\([^()]*\d+[^()]*\)
See a demo on regex101.com.
Backslashes need to be double escaped in R, so your expression would become
\\([^()]*\\d+[^()]*\\)
Broken down this is
\( # (
[^()]* # not ( nor ), 0+ times
\d+ # digits, 1+
[^()]* # same as above
\) # )
text <- c("Sekretär (dipl.) (G3)", "Zolldeklarant (3 Jahre)", "Grenzwächter (< 2 Jahre)")
gsub(".*\\((.*[0-9].*)\\).*","(\\1)",text)
Basically you ask gsub to select the whole string but to assign as a group (\1) the strings in a parentheses and including a number.
Plus, if you want to extract the last parentheses always, that could follow a different approach.

Replace a specific character only between parenthesis

Lest's say I have a string:
test <- "(pop+corn)-bread+salt"
I want to replace the plus sign that is only between parenthesis by '|', so I get:
"(pop|corn)-bread+salt"
I tried:
gsub("([+])","\\|",test)
But it replaces all the plus signs of the string (obviously)
If you want to replace all + symbols that are inside parentheses (if there may be 1 or more), you can use any of the following solutions:
gsub("\\+(?=[^()]*\\))", "|", x, perl=TRUE)
See the regex demo. Here, the + is only matched when it is followed with any 0+ chars other than ( and ) (with [^()]*) and then a ). It is only good if the input is well-formed and there is no nested parentheses as it does not check if there was a starting (.
gsub("(?:\\G(?!^)|\\()[^()]*?\\K\\+", "|", x, perl=TRUE)
This is a safer solution since it starts matching + only if there was a starting (. See the regex demo. In this pattern, (?:\G(?!^)|\() matches the end of the previous match (\G(?!^)) or (|) a (, then [^()]*? matches any 0+ chars other than ( and ) chars, and then \K discards all the matched text and \+ matches a + that will be consumed and replaced. It still does not handle nested parentheses.
Also, see an online R demo for the above two solutions.
library(gsubfn)
s <- "(pop(+corn)+unicorn)-bread+salt+malt"
gsubfn("\\((?:[^()]++|(?R))*\\)", ~ gsub("+", "|", m, fixed=TRUE), s, perl=TRUE, backref=0)
## => [1] "(pop(|corn)|unicorn)-bread+salt+malt"
This solves the problem of matching nested parentheses, but requires the gsubfn package. See another regex demo. See this regex description here.
Note that in case you do not have to match nested parentheses, you may use "\\([^()]*\\)" regex with the gsubfn code above. \([^()]*\) regex matches (, then any zero or more chars other than ( and ) (replace with [^)]* to match )) and then a ).
We can try
sub("(\\([^+]+)\\+","\\1|", test)
#[1] "(pop|corn)-bread+salt"

how do you capture text between first occurance of [ and last occurence of ] in R

I need to extrack text between [],
I have this:
x <- "corp_applicaiton[CORP_webapp1][1]"
I need to capture this text:
CORP_webapp1][1
then replace all special characters with under score:
I've tried this:
str_match(x, ".*\\[(.*?)].*")[,2]
but this outputs:
1
any ideas?
You can do this with regular expressions.
x<-c("corp_applicaiton[CORP_webapp1][1]")
x2 = sub(".*?\\[(.*)\\].*", "\\1", x)
gsub("\\W", "_", x2)
[1] "CORP_webapp1__1"
You may achieve what you need with a single gsubfn but still it requires 2 regular expressions:
> library(gsubfn)
> x<-c("corp_applicaiton[CORP_webapp1][1]")
> gsubfn("^[^[]*\\[(.*)].*$", function(m) gsub("\\W", "_", m), x)
[1] "CORP_webapp1__1"
It will look for the following pattern:
^ - start of string
[^[]* - 0+ chars other than [
\\[ - a literal [
(.*) - Group 1 capturing any 0+ chars as many as possible up to the last...
] - literal ]
.* - and any 0+ chars up to the string end.
Then, the nested callback function with gsub("\\W", "_", m) inside will replace each non-word char (\W) with a _ in the Group 1 value, and will only return that value.
.+\[(.+)\]\[(.+)\]
replace with
$1_$2
https://regex101.com/r/3BNz8j/1

Regex - Substitute character in a matching substring

Let's say I have the following string:
input = "askl jmsp wiqp;THIS IS A MATCH; dlkasl das, fm"
I need to replace the white-spaces with underscores, but only in the substrings that match a pattern. (In this case the pattern would be a semi-colon before and after.)
The expected output should be:
output = "askl jmsp wiqp;THIS_IS_A_MATCH; dlkasl das, fm"
Any ideas how to achieve that, preferably using regular expressions, and without splitting the string?
I tried:
gsub("(.*);(.*);(.*)", "\\2", input) # Pattern matching and
gsub(" ", "_", input) # Naive gsub
Couldn't put them both together though.
Regarding the original question:
Substitute character in a matching substring
You may do it easily with gsubfn:
> library(gsubfn)
> input = "askl jmsp wiqp;THIS IS A MATCH; dlkasl das, fm"
> gsubfn(";([^;]+);", function(g1) paste0(";",gsub(" ", "-", g1, fixed=TRUE),";"), input)
[1] "askl jmsp wiqp;THIS-IS-A-MATCH; dlkasl das, fm"
The ;([^;]+); matches any string starting with ; and up to the next ; capturing the text in-between and then replacing the whitespaces with hyphens only inside the captured part.
Another approach is to use a PCRE regex with a \G based regex with gsub:
p = "(?:\\G(?!\\A)|;)(?=[^;]*;)[^;\\s]*\\K\\s"
> gsub(p, "-", input, perl=TRUE)
[1] "askl jmsp wiqp;THIS-IS-A-MATCH; dlkasl das, fm"
See the online regex demo
Pattern details:
(?:\\G(?!\\A)|;) - a custom boundary: either the end of the previous successful match (\\G(?!\\A)) or (|) a semicolon
(?=[^;]*;) - a lookahead check: there must be a ; after 0+ chars other than ;
[^;\\s]* - 0+ chars other than ; and whitespaces
\\K - omitting the text matched so far
\\s - 1 single whitespace character (if multiple whitespaces are to be replaced with 1 hyphen, add + after it).

Resources