Splitting a comma- and semicolon-delimited string in R - r

I'm trying to split a string containing two entries and each entry has a specific format:
Category (e.g. active site/region) which is followed by a :
Term (e.g. His, Glu/nucleotide-binding motif A) which is followed by a ,
Here's the string that I want to split:
string <- "active site: His, Glu,region: nucleotide-binding motif A,"
This is what I have tried so far. Except for the two empty substrings, it produces the desired output.
unlist(str_extract_all(string, ".*?(?=,(?:\\w+|$))"))
[1] "active site: His, Glu" "" "region: nucleotide-binding motif A"
[4] ""
How do I get rid of the empty substrings?

You get the empty strings because .*? can also match an empty string where this assertion (?=,(?:\\w+|$)) is true
You can exclude matching a colon or comma using a negated character class before matching :
[^:,\n]+:.*?(?=,(?:\w|$))
Explanation
[^:,\n]+ Match 1+ chars other than : , or a newline
: Match the colon
.*? Match any char as least as possbiel
(?= Positive lookahead, assert that what is directly to the right from the current position:
, Match literally
(?:\w|$) Match either a single word char, or assert the end of the string
) Close the lookahead
Regex demo | R demo
string <- "active site: His, Glu,region: nucleotide-binding motif A,"
unlist(str_extract_all(string, "[^:,\\n]+:.*?(?=,(?:\\w|$))"))
Output
[1] "active site: His, Glu" "region: nucleotide-binding motif A"

Much longer and not as elegant as #The fourth bird +1,
but it works:
library(stringr)
string2 <- strsplit(string, "([^,]+,[^,]+),", perl = TRUE)[[1]][2]
string1 <- str_replace(string, string2, "")
string <- str_replace_all(c(string1, string2), '\\,$', '')
> string
[1] "active site: His, Glu"
[2] "region: nucleotide-binding motif A"

Related

Find closing parenthesis with regex in r

I have several strings with open and unclosed parenthesis. I managed to remove the opening parenthesis (if there is no closing one), but I do not manage to remove the closing parenthesis if there is no opening one. I want to leave those with matching parenthesis alone
string1 = "This (is solved"
string2 = "This is (fine)"
string3 = "This is the problem)"
This is what I was able to remove the first Problem case with (Opening parenthesis but no opening)
str_remove(data, "[(](?!.*[)])")
But I cannot seem to turn it around. The following grabs all closing parenthesis, but not the one without an oping.
"(?!.*[(])[)]"
Any ideas are appreciated!
If you do not need to handle nested paired (balanced) parentheses, you can use
gsub("(\\([^()]*\\))|[()]", "\\1", string)
See the regex demo. Details:
(\([^()]*\)) - Group 1 (\1 refers to this group value): (, then zero or more chars other than ( and ), and then a ) char
| - or
[()] - a ( or ) char.
See the R demo:
x <- c("This (is solved", "This is (fine)", "This is the problem)")
gsub("(\\([^()]*\\))|[()]", "\\1", x)
# => [1] "This is solved" "This is (fine)" "This is the problem"
If the parentheses can be nested, you can use
gsub("(\\((?:[^()]++|(?1))*\\))|[()]", "\\1", string, perl=TRUE)
See this regex demo. Details:
(\((?:[^()]++|(?1))*\)) - Group 1:
\( - a ( char
(?:[^()\n]++|(?1))* - zero or more sequences of either one or more chars other than ( and ), or the whole Group 1 pattern that is recursed
\) - a ) char
|[()] - or a ( / ) char.

Regex to match a pattern but not two specific cases

I want to match every cases of "-", but not these ones:
[\d]-[A-Z]
[A-Z]-[\d]
I tried this pattern: ((?<![A-Z])-(?![0-9]))|((?<![0-9])-(?![A-Z])) but some results are incorrect like: "RUA VF-32 N"
Can anyone help me?
A simple approach is to use grep with your current logic and inverting the result, and then run another grep to only keep those items that have a hyphen in them:
x <- c("QUADRA 120 - ASA BRANCA","FAZENDA LAGE -RODOVIA RIO VERDE","C-15","99-B","A-A")
grep("-", grep("[A-Z]-\\d|\\d-[A-Z]", x, invert=TRUE, value=TRUE), value=TRUE, fixed=TRUE)
# => [1] "QUADRA 120 - ASA BRANCA" "FAZENDA LAGE -RODOVIA RIO VERDE"
# [3] "A-A"
Here, [A-Z]-\\d|\\d-[A-Z] matches a hyphen either in between an uppercase ASCII etter or a digit or betweena digit and an ASCII uppercase letter. If there is a match, the result is inverted due to invert=TRUE.
See the R demo.
To only match - in all contexts other than in between a letter and a digit, you may use the PCRE regex based on SKIP-FAIL technique like
> grep("(?:\\d-[A-Z]|[A-Z]-\\d)(*SKIP)(*F)|-", x, perl=TRUE)
[1] 1 2
See this regex demo
Details
(?:\d-[A-Z]|[A-Z]-\d) - a non-capturing group that matches either a digit, - and then uppercase ASCII letter, or an uppercase ASCII letter, - and a digit
(*SKIP)(*F) - omit the current match and proceed looking for the next match at the end of the "failed" match
| - or
- - a hyphen.

R regex match things other than known characters

For a text field, I would like to expose those that contain invalid characters. The list of invalid characters is unknown; I only know the list of accepted ones.
For example for French language, the accepted list is
A-z, 1-9, [punc::], space, àéèçè, hyphen, etc.
The list of invalid charactersis unknown, yet I want anything unusual to resurface, for example, I would want
This is an 2-piece à-la-carte dessert to pass when
'Ã this Øs an apple' pumps up as an anomalie
The 'not contain' notion in R does not behave as I would like, for example
grep("[^(abc)]",c("abcdef", "defabc", "apple") )
(those that does not contain 'abc') match all three while
grep("(abc)",c("abcdef", "defabc", "apple") )
behaves correctly and match only the first two. Am I missing something
How can we do that in R ? Also, how can we put hypen together in the list of accepted characters ?
[a-z1-9[:punct:] àâæçéèêëîïôœùûüÿ-]+
The above regex matches any of the following (one or more times). Note that the parameter ignore.case=T used in the code below allows the following to also match uppercase variants of the letters.
a-z Any lowercase ASCII letter
1-9 Any digit in the range from 1 to 9 (excludes 0)
[:punct:] Any punctuation character
The space character
àâæçéèêëîïôœùûüÿ Any valid French character with a diacritic mark
- The hyphen character
See code in use here
x <- c("This is an 2-piece à-la-carte dessert", "Ã this Øs an apple")
gsub("[a-z1-9[:punct:] àâæçéèêëîïôœùûüÿ-]+", "", x, ignore.case=T)
The code above replaces all valid characters with nothing. The result is all invalid characters that exist in the string. The following is the output:
[1] "" "ÃØ"
If by "expose the invalid characters" you mean delete the "accepted" ones, then a regex character class should be helpful. From the ?regex help page we can see that a hyphen is already part of the punctuation character vector;
[:punct:]
Punctuation characters:
! " # $ % & ' ( ) * + , - . / : ; < = > ? # [ \ ] ^ _ ` { | } ~
So the code could be:
x <- 'Ã this Øs an apple'
gsub("[A-z1-9[:punct:] àéèçè]+", "", x)
#[1] "ÃØ"
Note that regex has a predefined, locale-specific "[:alpha:]" named character class that would probably be both safer and more compact than the expression "[A-zàéèçè]" especially since the post from ctwheels suggests that you missed a few. The ?regex page indicates that "[0-9A-Za-z]" might be both locale- and encoding-specific.
If by "expose" you instead meant "identify the postion within the string" then you could use the negation operator "^" within the character class formalism and apply gregexpr:
gregexpr("[^A-z1-9[:punct:] àéèçè]+", x)
[[1]]
[1] 1 8
attr(,"match.length")
[1] 1 1

How to completely remove head and tail white spaces or punctuation characters?

I have string_a, such that
string_a <- " ,A thing, something, . ."
Using regex, how can I just retain "A thing, something"?
I have tried the following and got such output:
sub("[[:punct:]]$|^[[:punct:]]","", trimws(string_a))
[1] "A thing, something, . ."
We can use gsub to match one or more punctuation characters including spaces ([[:punct:] ] +) from the start (^) or | those characters until the end ($) of the string and replace it with blank ("")
gsub("^[[:punct:] ]+|[[:punct:] ]+$", "", string_a)
#[1] "A thing, something"
Note: sub will replace only a single instance
Or as #Cath mentioned [[:punct:] ] can be replaced with \\W

Why does is this end of line (\\b) not recognised as word boundary in stringr/ICU and Perl

Using stringr i tried to detect a € sign at the end of a string as follows:
str_detect("my text €", "€\\b") # FALSE
Why is this not working? It is working in the following cases:
str_detect("my text a", "a\\b") # TRUE - letter instead of €
grepl("€\\b", "2009in €") # TRUE - base R solution
But it also fails in perl mode:
grepl("€\\b", "2009in €", perl=TRUE) # FALSE
So what is wrong about the €\\b-regex? The regex €$ is working in all cases...
When you use base R regex functions without perl=TRUE, TRE regex flavor is used.
It appears that TRE word boundary:
When used after a non-word character matches the end of string position, and
When used before a non-word character matches the start of string position.
See the R tests:
> gsub("\\b\\)", "HERE", ") 2009in )")
[1] "HERE 2009in )"
> gsub("\\)\\b", "HERE", ") 2009in )")
[1] ") 2009in HERE"
>
This is not a common behavior of a word boundary in PCRE and ICU regex flavors where a word boundary before a non-word character only matches when the character is preceded with a word char, excluding the start of string position (and when used after a non-word character requires a word character to appear right after the word boundary):
There are three different positions that qualify as word boundaries:
- Before the first character in the string, if the first character is a word character.
- After the last character in the string, if the last character is a word character.
- Between two characters in the string, where one is a word character and the other is not a word character.
\b
is equivalent to
(?:(?<!\w)(?=\w)|(?<=\w)(?!\w))
which is to say it matches
between a word char and a non-word char,
between a word char and the start of the string, and
between a word char and the end of the string.
€ is a symbol, and symbols aren't word characters.
$ uniprops €
U+20AC <€> \N{EURO SIGN}
\pS \p{Sc}
All Any Assigned Common Zyyy Currency_Symbol Sc Currency_Symbols S Gr_Base Grapheme_Base Graph X_POSIX_Graph GrBase Print X_POSIX_Print Symbol Unicode
If your language supports look-behinds and look-aheads, you could use the following to find a boundary between a space and non-space (treating the start and end as a space).
(?:(?<!\S)(?=\S)|(?<=\S)(?!\S))

Resources