Match only parenthesis with text and numbers in R - r

I would like to replace the parenthesis and the text between parenthesis in string variables. However I only want to replace those parenthesis with at least one number in it.
Example string:
text <- c("Sekretär (dipl.) (G3)", "Zolldeklarant (3 Jahre)", "Grenzwächter (< 2 Jahre)")
I tried the following:
str_extract_all(text, " *\\(.*?\\d+.*?\\) *")
It does extract the text in parenthesis, but in the first one, it matches also the first parenthesis without any number.
The extraction should look like:
" (G3)"
" (3 Jahre)"
" (< 2 Jahre)"

If you want to replace these terms in parentheses, containing at least one number, then sub is a good base R option:
text
sapply(text, function (x) {
gsub("\\([^()]*\\d[^()]*\\)", "REMOVED", x)
})
[1] "Sekretär (dipl.) (G3)" "Zolldeklarant (3 Jahre)" "Grenzwächter (< 2 Jahre)"
[1] "Sekretär (dipl.) REMOVED" "Zolldeklarant REMOVED" "Grenzwächter REMOVED"
I have replaced with the literal text REMOVED just as a placeholder to show the replacement.
Edit:
If you just want to extract these terms, we can also use sub for this:
sapply(text, function (x) {
gsub(".*(\\([^()]*\\d[^()]*\\)).*", "\\1", x)
})
[1] "(G3)" "(3 Jahre)" "(< 2 Jahre)"
Here, we capture the term in parentheses, then replace the entire string with just the first (and only) capture group \\1.

You can use
\([^()]*\d+[^()]*\)
See a demo on regex101.com.
Backslashes need to be double escaped in R, so your expression would become
\\([^()]*\\d+[^()]*\\)
Broken down this is
\( # (
[^()]* # not ( nor ), 0+ times
\d+ # digits, 1+
[^()]* # same as above
\) # )

text <- c("Sekretär (dipl.) (G3)", "Zolldeklarant (3 Jahre)", "Grenzwächter (< 2 Jahre)")
gsub(".*\\((.*[0-9].*)\\).*","(\\1)",text)
Basically you ask gsub to select the whole string but to assign as a group (\1) the strings in a parentheses and including a number.
Plus, if you want to extract the last parentheses always, that could follow a different approach.

Related

R: add a character to a specific spot in string, trouble with regex syntax

I have a list of string like so:
batch1, batch2, batch3, batch10, batch11
I am trying to add a 0 before the single digits batch01, batch02, batch03, batch10, batch11
I have found many similar questions and tried to write my own regex. I am very close, but I can't quite make it do what I want.
Batch <- gsub('(.{5})([0-9]{1}\\b)','\\10\\2', Batch)
outputs batch01, batch02, batch 03, batch100, batch110
\\s instead of \\b doesn't change any values
sampleNames$Batch <- gsub('(.{5})([0-9]{1})','\\10\\2', sampleNames$Batch) outputs bacth01, batch02, batch03, batch010, batch011
I've played around with a few other versions but I cannot seem to get it correct. I know this is a somewhat repetitive question, but I have not been able to alter previous solutions to do what I need to do.
We can capture the last digit and the lower case letter before it as two groups, then in the replacement specify the backreference of the groups and the 0 in between. Thus, it won't match the ones having two digits at the end of the string
sub("([a-z])(\\d)$", "\\10\\2", Batch)
[1] "batch01" "batch02" "batch03" "batch10" "batch11"
Or we may use sprintf/str_pad with str_replace
library(stringr)
str_replace(Batch, "\\d+$", function(x) sprintf("%02d", as.numeric(x)))
[1] "batch01" "batch02" "batch03" "batch10" "batch11"
data
Batch <- c("batch1", "batch2", "batch3", "batch10", "batch11")
Use
sampleNames$Batch <- sub("(\\D|^)(\\d)$", "\\10\\2", sampleNames$Batch, perl=TRUE)
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
\D non-digits (all but 0-9)
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
( group and capture to \2:
--------------------------------------------------------------------------------
\d digits (0-9)
--------------------------------------------------------------------------------
) end of \2
--------------------------------------------------------------------------------
$ before an optional \n, and the end of the
string
You can also use the following solution:
sapply(vec, function(x) {
d <- gsub("([[:alpha:]]+)(\\d)", "\\2", x)
if(nchar(d) == 1) {
gsub("([[:alpha:]]+)(\\d)", "\\10\\2", x)
} else {
x
}
})
batch1 batch2 batch3 batch10 batch11
"batch01" "batch02" "batch03" "batch10" "batch11"

How to match phonemic transcriptions with a single vowel except if a condition applies

I have phonemic transcriptions of English words such as these:
test <- c("ˈsɜːtnli", "ˈtwɛnti", "ˈfɒksi", "kɑːnt", "ʧeɪnʤd", "vɪkˈtɔːrɪə", "wɒznt", "ðeər", "dɪdnt",
"ˈdɪzni", "ˈəʊnli", "ˈfæbrɪks", "sɪˈkjʊərɪti", "ˈnjuːzˌpeɪpər", "ɑhɑː")
I'd like to match mono-syllabic words, i.e., words that contain a single vowel. My set of phonemic vowels is this:
vowel <- "iː|aɪ|ɔː|ɔɪ|əʊ|ɛə|eɪ|aʊ|eə|uː|ɑː|ɪə|ɜː|ʊə|ə|ɪ|ɒ|ʊ|ʌ|æ|e|ɑ|ɛ|i"
Using str_count and the vector vowel as pattern, I'm able to match a fairly good set of words:
library(stringr)
test[str_count(test, vowel) == 1]
[1] "kɑːnt" "ʧeɪnʤd" "wɒznt" "ðeər" "dɪdnt"
However, wɒznt and dɪdntcan be seen as bi-syllabic (as the nsound can replace a vowel so that nt counts as a second vowel). So the question is, how can I match mono-syllabic words except those that end in nt?
What I've tried so far is this set operation, which works well but looks clumsy:
setdiff(test[str_count(test, vowel) == 1], test[str_count(test, paste0("[^", vowel, "]nt$")) == 1])
[1] "kɑːnt" "ʧeɪnʤd" "ðeər"
I'd much rather have a single more concise regex. Any ideas?
You can use
test <- c("ˈsɜːtnli", "ˈtwɛnti", "ˈfɒksi", "kɑːnt", "ʧeɪnʤd", "vɪkˈtɔːrɪə", "wɒznt", "ðeər", "dɪdnt",
"ˈdɪzni", "ˈəʊnli", "ˈfæbrɪks", "sɪˈkjʊərɪti", "ˈnjuːzˌpeɪpər", "ɑhɑː")
vowel <- "iː|aɪ|ɔː|ɔɪ|əʊ|ɛə|eɪ|aʊ|eə|uː|ɑː|ɪə|ɜː|ʊə|ə|ɪ|ɒ|ʊ|ʌ|æ|e|ɑ|ɛ|i"
library(stringr)
p <- paste0("^(?!.*(?<!",vowel,")nt$)(?:(?!",vowel,").)*(?:",vowel,")(?:(?!",vowel,").)*$")
test[str_detect(test, p)]
## => [1] "kɑːnt" "ʧeɪnʤd" "ðeər"
See the online R demo. See the regex demo. The pattern means
^ - start of string
(?!.*(?<!",vowel,")nt$) - immediately to the right, there must not be any 0+ chars other than line break chars as many as possible followed with nt (not preceded with any of the specified vowel sound sequences) and end of string
(?:(?!",vowel,").)* - any char but a line break char, zero or more times as many as possible, that does not start a vowel char sequence
(?:",vowel,") - any of the specified vowel sound sequences
(?:(?!",vowel,").)* - any char but a line break char, zero or more times as many as possible, that does not start a vowel char sequence
$ - end of string.
This is a somewhat concise solution (thanks to #G5W for the decisive hint):
vowel_cc <- paste0(unique(unlist(strsplit(gsub("\\|", "", vowel), ""))), collapse = "")
vowel_cc
[1] "iːaɪɔəʊɛeuɑɜɒʌæ"
test[str_count(test, paste0(vowel, "|[^", vowel_cc, "]+nt$")) == 1]
[1] "kɑːnt" "ʧeɪnʤd" "ðeər"
This solution uses a vector vowel_cc consisting of all unique characters in vowels. These serve as input for a negated character class. The pattern specifies nt as one of the vowel alternatives on the condition that it be preceded by one or more non-vowel_ccs and occur at string end.

Positive Lookbehind and Lookahead to the end of string

My string patterns looks like this:
UNB+UNOC:3+4399945681577+_GLN_Company__+180101:0050+10870 and I am trying to extract everything after the second last +, i.e. 180101:0050+10870.
Thus far, I managed to address the second last block 180101:0050 with this expression (?<=\+)[^\+]+(?=\+[^\+]*$) but fail to include the last block including the last +. Here is my sample: regex101
The expression is meant for R and I still need to escape the characters later on. This format it just for testing purposes in Regex101.
We could capture group based on the occurrence of + from the end ($) of the string.
sub(".*\\+([^+]+\\+[^+]+$)", "\\1", str1)
#[1] "180101:0050+10870"
data
str1 <- "UNB+UNOC:3+4399945681577+_GLN_Company__+180101:0050+10870"
You may use
\+\K[^+]+\+[^+]*$
Or, if you would like to use it with stringr::str_extract:
(?<=\+)[^+]+\+[^+]*$
See the regex demo. Details:
\+ - a + char
\K - match reset operator
(?<=\+) - location right after a + symbol
[^+]+ - one or more chars other than +
\+ - a +
[^+]+ - one or more chars other than +
$ - end of string.
See R demo online:
x <- "UNB+UNOC:3+4399945681577+_GLN_Company__+180101:0050+10870"
regmatches(x, regexpr("\\+\\K[^+]+\\+[^+]*$", x, perl=TRUE))
## => [1] "180101:0050+10870"
library(stringr)
str_extract(x, "(?<=\\+)[^+]+\\+[^+]*$")
## => [1] "180101:0050+10870"
Another way you can do in this case:
library(stringr)
str_extract("UNB+UNOC:3+4399945681577+_GLN_Company__+180101:0050+10870", "\\d+:\\d+\\+\\d+")
#"180101:0050+10870"

Replace a specific character only between parenthesis

Lest's say I have a string:
test <- "(pop+corn)-bread+salt"
I want to replace the plus sign that is only between parenthesis by '|', so I get:
"(pop|corn)-bread+salt"
I tried:
gsub("([+])","\\|",test)
But it replaces all the plus signs of the string (obviously)
If you want to replace all + symbols that are inside parentheses (if there may be 1 or more), you can use any of the following solutions:
gsub("\\+(?=[^()]*\\))", "|", x, perl=TRUE)
See the regex demo. Here, the + is only matched when it is followed with any 0+ chars other than ( and ) (with [^()]*) and then a ). It is only good if the input is well-formed and there is no nested parentheses as it does not check if there was a starting (.
gsub("(?:\\G(?!^)|\\()[^()]*?\\K\\+", "|", x, perl=TRUE)
This is a safer solution since it starts matching + only if there was a starting (. See the regex demo. In this pattern, (?:\G(?!^)|\() matches the end of the previous match (\G(?!^)) or (|) a (, then [^()]*? matches any 0+ chars other than ( and ) chars, and then \K discards all the matched text and \+ matches a + that will be consumed and replaced. It still does not handle nested parentheses.
Also, see an online R demo for the above two solutions.
library(gsubfn)
s <- "(pop(+corn)+unicorn)-bread+salt+malt"
gsubfn("\\((?:[^()]++|(?R))*\\)", ~ gsub("+", "|", m, fixed=TRUE), s, perl=TRUE, backref=0)
## => [1] "(pop(|corn)|unicorn)-bread+salt+malt"
This solves the problem of matching nested parentheses, but requires the gsubfn package. See another regex demo. See this regex description here.
Note that in case you do not have to match nested parentheses, you may use "\\([^()]*\\)" regex with the gsubfn code above. \([^()]*\) regex matches (, then any zero or more chars other than ( and ) (replace with [^)]* to match )) and then a ).
We can try
sub("(\\([^+]+)\\+","\\1|", test)
#[1] "(pop|corn)-bread+salt"

Regex - Substitute character in a matching substring

Let's say I have the following string:
input = "askl jmsp wiqp;THIS IS A MATCH; dlkasl das, fm"
I need to replace the white-spaces with underscores, but only in the substrings that match a pattern. (In this case the pattern would be a semi-colon before and after.)
The expected output should be:
output = "askl jmsp wiqp;THIS_IS_A_MATCH; dlkasl das, fm"
Any ideas how to achieve that, preferably using regular expressions, and without splitting the string?
I tried:
gsub("(.*);(.*);(.*)", "\\2", input) # Pattern matching and
gsub(" ", "_", input) # Naive gsub
Couldn't put them both together though.
Regarding the original question:
Substitute character in a matching substring
You may do it easily with gsubfn:
> library(gsubfn)
> input = "askl jmsp wiqp;THIS IS A MATCH; dlkasl das, fm"
> gsubfn(";([^;]+);", function(g1) paste0(";",gsub(" ", "-", g1, fixed=TRUE),";"), input)
[1] "askl jmsp wiqp;THIS-IS-A-MATCH; dlkasl das, fm"
The ;([^;]+); matches any string starting with ; and up to the next ; capturing the text in-between and then replacing the whitespaces with hyphens only inside the captured part.
Another approach is to use a PCRE regex with a \G based regex with gsub:
p = "(?:\\G(?!\\A)|;)(?=[^;]*;)[^;\\s]*\\K\\s"
> gsub(p, "-", input, perl=TRUE)
[1] "askl jmsp wiqp;THIS-IS-A-MATCH; dlkasl das, fm"
See the online regex demo
Pattern details:
(?:\\G(?!\\A)|;) - a custom boundary: either the end of the previous successful match (\\G(?!\\A)) or (|) a semicolon
(?=[^;]*;) - a lookahead check: there must be a ; after 0+ chars other than ;
[^;\\s]* - 0+ chars other than ; and whitespaces
\\K - omitting the text matched so far
\\s - 1 single whitespace character (if multiple whitespaces are to be replaced with 1 hyphen, add + after it).

Resources