Find closing parenthesis with regex in r - r

I have several strings with open and unclosed parenthesis. I managed to remove the opening parenthesis (if there is no closing one), but I do not manage to remove the closing parenthesis if there is no opening one. I want to leave those with matching parenthesis alone
string1 = "This (is solved"
string2 = "This is (fine)"
string3 = "This is the problem)"
This is what I was able to remove the first Problem case with (Opening parenthesis but no opening)
str_remove(data, "[(](?!.*[)])")
But I cannot seem to turn it around. The following grabs all closing parenthesis, but not the one without an oping.
"(?!.*[(])[)]"
Any ideas are appreciated!

If you do not need to handle nested paired (balanced) parentheses, you can use
gsub("(\\([^()]*\\))|[()]", "\\1", string)
See the regex demo. Details:
(\([^()]*\)) - Group 1 (\1 refers to this group value): (, then zero or more chars other than ( and ), and then a ) char
| - or
[()] - a ( or ) char.
See the R demo:
x <- c("This (is solved", "This is (fine)", "This is the problem)")
gsub("(\\([^()]*\\))|[()]", "\\1", x)
# => [1] "This is solved" "This is (fine)" "This is the problem"
If the parentheses can be nested, you can use
gsub("(\\((?:[^()]++|(?1))*\\))|[()]", "\\1", string, perl=TRUE)
See this regex demo. Details:
(\((?:[^()]++|(?1))*\)) - Group 1:
\( - a ( char
(?:[^()\n]++|(?1))* - zero or more sequences of either one or more chars other than ( and ), or the whole Group 1 pattern that is recursed
\) - a ) char
|[()] - or a ( / ) char.

Related

grep in R, literal and pattern match

I have seen in manuals how to use grep to match either a pattern or an exact string. However, I cannot figure out how to do both at the same time. I have a latex file where I want to find the following pattern:
\caption[SOME WORDS]
and replace it with:
\caption[\textit{SOME WORDS}]
I have tried with:
texfile <- sub('\\caption[','\\caption[\\textit ', texfile, fixed=TRUE)
but I do not know how to tell grep that there should be some text after the square bracket, and then a closed square bracket.
You can use
texfile <- "\\caption[SOME WORDS]" ## -> \caption[\textit{SOME WORDS}]
texfile <-gsub('(\\\\caption\\[)([^][]*)]','\\1\\\\textit{\\2}]', texfile)
cat(texfile)
## -> \caption[\textit{SOME WORDS}]
See the R demo online.
Details:
(\\caption\[) - Group 1 (\1 in the replacement pattern): a \caption[ string
([^][]*) - Group 2 (\2 in the replacement pattern): any zero or more chars other than [ and ]
] - a ] char.
Another solution based on a PCRE regex:
gsub('\\Q\\caption[\\E\\K([^][]*)]','\\\\textit{\\1}]', texfile, perl=TRUE)
See this R demo online. Details:
\Q - start "quoting", i.e. treating the patterns to the right as literal text
\caption[ - a literal fixed string
\E - stop quoting the pattern
\K - omit text matched so far
([^][]*) - Group 1 (\1): any zero or more non-bracket chars
] - a ] char.

Extract exact matches from array

Assume I have text and I want to extract exact matches. How can I do this efficiently:
test_text <- c("[]", "[1234]", "[1234a]", "[v1256a] ghjk kjh",
"[othername1256b] kjhgfd hgj",
"[v1256] ghjk kjh", "[v1256] kjhgfd hgj",
" text here [name1991] and here",
"[name1990] this is an explanation",
"[name1991] this is another explanation",
"[mäölk1234]")
expected <- c("[v1256a]", "[othername1256b]", "[v1256]", "[v1256]", "[name1991]",
"[name1990]", "[name1991]", "[mäölk1234]")
# This works:
regmatches(text, regexpr("\\[.*[0-9]{4}.*\\]", text))
But I guess something like "\\[.*[0-9]{4}(?[a-z])]\\]" would be better but it throws an error
Error in regexpr("\[.[0-9]{4}(?[a-z])]\]", text) : invalid
regular expression '[.[0-9]{4}(?[a-z])]]', reason 'Invalid regexp'
Only ONE letter should follow the year, but there can be none, see example. Sorry, I rarly use regexpr...
Updated question solution
It seems you want to extract all occurrences of 1+ letters followed with 4 digits and then an optional letter inside square brackets.
Use
test_text <- c("[]", "[1234]", "[1234a]", "[v1256a] ghjk kjh",
"[othername1256b] kjhgfd hgj",
"[v1256] ghjk kjh", "[v1256] kjhgfd hgj",
" text here [name1991] and here",
"[name1990] this is an explanation",
"[name1991] this is another explanation",
"[mäölk1234]")
regmatches(test_text, regexpr("\\[\\p{L}+[0-9]{4}\\p{L}?]", test_text, perl=TRUE))
# => c("[v1256a]", "[othername1256b]", "[v1256]", "[v1256]", "[name1991]",
# "[name1990]", "[name1991]", "[mäölk1234]")
See the R demo online. NOTE that you need to use a PCRE regex for this to work, perl=TRUE is crucial here.
Details
\[ - a [ char
\p{L}+ - 1+ any Unicode letters
[0-9]{4} - four ASCII digits
\\p{L}? - an optional any Unicode letter
] - a ] char.
Original answer
Use
regmatches(test_text, regexpr("\\[[^][]*[0-9]{4}[[:alpha:]]?]", test_text))
Or
regmatches(test_text, regexpr("\\[[^][]*[0-9]{4}[a-zA-Z]?]", test_text))
See the regex demo and a Regulex graph:
Details
\[ - a [ char
[^][]* - 0 or more chars other than [ and ] (HINT: if you only expect letters here replace with [[:alpha:]]* or [a-zA-Z]*)
[0-9]{4} - four digits
[[:alpha:]]? - an optional letter (or [a-zA-Z]? will match any ASCII optional letter)
] - a ] char
R test:
regmatches(test_text, regexpr("\\[[^][]*[0-9]{4}[[:alpha:]]?]", test_text))
## => [1] "[v1256a]" "[othername1256b]" "[v1256]" "[v1256]" "[name1991]" "[name1990]" "[name1991]"

Finding a word with condition in a vector with regex on R (perl)

I would like to find the rows in a vector with the word 'RT' in it or 'R' but not if the word 'RT' is preceded by 'no'.
The word RT may be preceded by nothing, a space, a dot, etc.
With the regex, I tried :
grep("(?<=[no] )RT", aaa,ignore.case = FALSE, perl = T)
Which was giving me all the rows with "no RT".
and
grep("(?=[^no].*)RT",aaa , perl = T)
which was giving me all the rows containing 'RT' with and without 'no' at the beginning.
What is my mistake? I thought the ^ was giving everything but the character that follows it.
Example :
aaa = c("RT alone", "no RT", "CT/RT", "adj.RTx", "RT/CT", "lang, RT+","npo RT" )
(?<=[no] )RT matches any RT that is immediately preceded with "n " or "o ".
You should use a negative lookbehind,
"(?<!no )RT"
See the regex demo.
Or, if you need to check for a whole word no,
"(?<!\\bno )RT"
See this regex demo.
Here, (?<!no ) makes sure there is no no immediately to the left of the current location, and only then RT is consumed.

R regex match things other than known characters

For a text field, I would like to expose those that contain invalid characters. The list of invalid characters is unknown; I only know the list of accepted ones.
For example for French language, the accepted list is
A-z, 1-9, [punc::], space, àéèçè, hyphen, etc.
The list of invalid charactersis unknown, yet I want anything unusual to resurface, for example, I would want
This is an 2-piece à-la-carte dessert to pass when
'Ã this Øs an apple' pumps up as an anomalie
The 'not contain' notion in R does not behave as I would like, for example
grep("[^(abc)]",c("abcdef", "defabc", "apple") )
(those that does not contain 'abc') match all three while
grep("(abc)",c("abcdef", "defabc", "apple") )
behaves correctly and match only the first two. Am I missing something
How can we do that in R ? Also, how can we put hypen together in the list of accepted characters ?
[a-z1-9[:punct:] àâæçéèêëîïôœùûüÿ-]+
The above regex matches any of the following (one or more times). Note that the parameter ignore.case=T used in the code below allows the following to also match uppercase variants of the letters.
a-z Any lowercase ASCII letter
1-9 Any digit in the range from 1 to 9 (excludes 0)
[:punct:] Any punctuation character
The space character
àâæçéèêëîïôœùûüÿ Any valid French character with a diacritic mark
- The hyphen character
See code in use here
x <- c("This is an 2-piece à-la-carte dessert", "Ã this Øs an apple")
gsub("[a-z1-9[:punct:] àâæçéèêëîïôœùûüÿ-]+", "", x, ignore.case=T)
The code above replaces all valid characters with nothing. The result is all invalid characters that exist in the string. The following is the output:
[1] "" "ÃØ"
If by "expose the invalid characters" you mean delete the "accepted" ones, then a regex character class should be helpful. From the ?regex help page we can see that a hyphen is already part of the punctuation character vector;
[:punct:]
Punctuation characters:
! " # $ % & ' ( ) * + , - . / : ; < = > ? # [ \ ] ^ _ ` { | } ~
So the code could be:
x <- 'Ã this Øs an apple'
gsub("[A-z1-9[:punct:] àéèçè]+", "", x)
#[1] "ÃØ"
Note that regex has a predefined, locale-specific "[:alpha:]" named character class that would probably be both safer and more compact than the expression "[A-zàéèçè]" especially since the post from ctwheels suggests that you missed a few. The ?regex page indicates that "[0-9A-Za-z]" might be both locale- and encoding-specific.
If by "expose" you instead meant "identify the postion within the string" then you could use the negation operator "^" within the character class formalism and apply gregexpr:
gregexpr("[^A-z1-9[:punct:] àéèçè]+", x)
[[1]]
[1] 1 8
attr(,"match.length")
[1] 1 1

remove all characters between string and bracket in R

Say I have a dataframe df in which a column df$strings contains strings like
[cat 00.04;09]
[cat 00.04;10]
and so on. I want to remove all characters between "[cat" and "]" to yield
[cat]
[cat]
I've tried this using gsub but it's not working and I'm not sure what I'm doing wrong:
gsub('cat*?\\]', '', df)
Note that cat*?\\] patten matches ca, then any 0+ t chars but as few as possible and then ].
You want to match any chars other than ] between [cat and ]:
gsub('\\[cat[^]]*\\]', '[cat]', df$strings)
Here,
\\[ - matches [
cat - matches cat
[^]]* - 0+ chars other than ] (note that ] inside the bracket expression should not be escaped when placed at the start - else, if you escape it, you will need to add perl=TRUE argument since PCRE regex engine can handle regex escapes inside bracket expressions (not the default TRE))
\\] - a ] (you do not even need to escape it, you may just use ]).
See the R demo:
x <- c("[cat 00.04;09]", "[cat 00.04;10]")
gsub('\\[cat[^]]*\\]', '[cat]', x)
## => [1] "[cat]" "[cat]"
If cat can be any word, use
gsub('\\[(\\w+)[^]]*\\]', '[\\1]', x)
where (\\w+) is a capturing group with ID=1 that matches 1 or more word chars, and \\1 in the replacement pattern is a replacement backreference that stands for the group value.

Resources