Is there a way to do a negative match using regex sub? - r

Say I have a vector of strings,
g<-c("bunchofstuff>query=true/fun/weird>bunchofstuff", "bunchofstuff>query=animals/octopus/weird>bunchofstuff", "bunchofstuff>query=flowers/sunshine/fun>bunchofstuff", "
bunchofstuff>query=fun/true/sunshine>bunchofstuff"
and I want to essentially use sub to erase anything after query=, until the end of the string, IF query= is not followed by true (ideally in any position). As far as I can tell, there isn't a useful substitution for ! in sub (seems to be some workarounds in grepl).
What I want is
newvariable<-c("bunchofstuff>query=true/fun/weird>bunchofstuff", "bunchofstuff>query=", "bunchofstuff>query=", "bunchofstuff>query=fun/true/sunshine>bunchofstuff"

You can do that:
sub('query=\\K(?:(?!true).)+$', '', g, perl=TRUE)
This technique uses a negative lookahead assertion (?!true) that checks before each character . if "true" doesn't follow. All is in a non-capturing group repeated until the end of the string $.
\\K is used to start the matched string after it to preserve the query= substring. (Note that it's only a convenient way to avoid a capture group or to rewrite query= in the replacement string.)
You can be more specific using word-boundaries to be sure that "true" isn't a part of another word:
sub('query=\\K(?:(?!\\btrue\\b).)+$', '', g, perl=TRUE)

Related

REGEX pattern match in R for Course number

I need to identify matching course number that have xx.3xxxxxx.
These are some examples of the course numbers.
26.3730004
27.0210000
26.3730009
26.7114001
23.9610071
26.0A34430
23.3670005
26.0B05430
I tried many patterns one example I used is the pattern below. It did not get any match.
"[^0-9]{2}\Q.\E3[^0-9]+$"
I tried using grep and grepl. I actually need the code to return indexes.
This code shows my attempt to tag the rows that have matches.
Teacher$virtual[
which(
grepl("[^0-9]{2}\\Q.\\E3[^0-9]+$",Teacher$CourseNumber))]
<- "1"
I need to remove any row from my dataframe that have the course number with that pattern. XX.3XXXXXX
But, my code did not find any match. Can you please help me?
You should use
grepl("^[0-9]{2}\\.3", Teacher$CourseNumber)
See the regex graph:
Details:
^ - start of a string
[0-9]{2} - two digits
\\. - a dot (note that a regex escape is a literal backslash, but inside a string literal, "...", a single backslash is used to form string escape sequences, hence the backslash must be double to obtain a literal backslash char necessary for a regex escape)
3 - a 3 char.
NOTE: If you want to use in-pattern quoting with \Q and \E (in between which all chars are treated literally) you need to use PCRE regex, add perl=TRUE and use
grepl("^[0-9]{2}\\Q.\\E3", Teacher$CourseNumber, perl=TRUE)
Now, the dot is treated as a literal dot, not a . metacharacter that matches any char but a line break char (in a PCRE regex, . does not match line break chars by default).
Here, this simple expression would likely cover that:
^[0-9]{2}\.[3].+$
which has a [3] boundary right after the .. It would probably work without start and end anchors:
[0-9]{2}\.[3].+
Demo
We can add or reduce the boundaries, if it'd be necessary.

Subsetting a value based on partial pattern

I'm trying to subset out using regular expressions, the url: happy_to-learn.com.
As I'm really new to regex, could someone help with my code as to why it does not work?
x <- c("happy_to-learn.com", "His_is-omitted.net")
str_subset(x, "^[a-zA-Z](\\_|\\-)*\\.com$")
I understand that ^[a-zA-Z](\\_|\\-)* this portion here refers to, "Start when you hit a range of alphabets from a to z or A to Z, and it contains either _ or -, if yes, then subset out this portion with 0 or more matches.
However, is it possible continue from this code by adding the back part of the value i wish to subset? i.e. \\.com$ refers to all values that end with .com.
Is there something like "^[a-zA-Z](\\_|\\-)*...\\.com$" in regex?
We need to specify one or more with + as the _ or - are not just after the first letter.
str_subset(x, "^[a-zA-Z]+(\\_|\\-).*\\.com$")
#[1] "happy_to-learn.com"
Also, the .* refers to zero or more characters as . can be any character until the . and 'com' at the end ($) of the string
Why use an external package? grep can do it too.
grep("^[[:alpha:]_-]+.*\\.com$", x, value = TRUE)
#[1] "happy_to-learn.com"
Explanation.
"^" marks the beginning of the string.
"[:alpha:] matches any alphabetic character, upper or lower case in a portable way.
"^[[:alpha:]_-]+" between [], there are alternative characters to match repeated one or more times. Alphabetic or the underscore _ or the minus sign -.
"^[[:alpha:]_-]+.*" The above followed by any character repeated zero or more times.
"^[[:alpha:]_-]+.*\\.com$" ending with the string ".com" where the dot is not a metacharacter and therefore must be escaped.

R - replace last instance of a regex match and everything afterwards

I'm trying to use a regex to replace the last instance of a phrase (and everything after that phrase, which could be any character):
stringi::stri_replace_last_regex("_AB:C-_ABCDEF_ABC:45_ABC:454:", "_ABC.*$", "CBA")
However, I can't seem to get the refex to function properly:
Input: "_AB:C-_ABCDEF_ABC:45_ABC:454:"
Actual output: "_AB:C-CBA"
Desired output: "_AB:C-_ABCDEF_ABC:45_CBA"
I have tried gsub() as well but that hasn't worked.
Any ideas where I'm going wrong?
One solution is:
sub("(.*)_ABC.*", "\\1_CBA", Input)
[1] "_AB:C-_ABCDEF_ABC:45_CBA"
Have a look at what stringi::stri_replace_last_regex does:
Replaces with the given replacement string last substring of the input that matches a regular expression
What does your _ABC.*$ pattern match inside _AB:C-_ABCDEF_ABC:45_ABC:454:? It matches the first _ABC (that is right after C-) and all the text after to the end of the line (.*$ grabs 0+ chars other than line break chars to the end of the line). Hence, you only have 1 match, and it is the last.
Solutions can be many:
1) Capturing all text before the last occurrence of the pattern and insert the captured value with a replacement backreference (this pattern does not have to be anchored at the end of the string with $):
sub("(.*)_ABC.*", "\\1_CBA","_AB:C-_ABCDEF_ABC:45_ABC:454:")
2) Using a tempered greedy token to make sure you only match any char that does not start your pattern up to the end of the string after matching it (this pattern must be anchored at the end of the string with $):
sub("(?s)_ABC(?:(?!_ABC).)*$", "_CBA","_AB:C-_ABCDEF_ABC:45_ABC:454:", perl=TRUE)
Note that this pattern will require perl=TRUE argument to be parsed with a PCRE engine with sub (or you may use stringr::str_replace that is ICU regex library powered and supports lookaheads)
3) A negative lookahead may be used to make sure your pattern does not appear anywhere to the right of your pattern (this pattern does not have to be anchored at the end of the string with $):
sub("(?s)_ABC(?!.*_ABC).*", "_CBA","_AB:C-_ABCDEF_ABC:45_ABC:454:", perl=TRUE)
See the R demo online, all these three lines of code returning _AB:C-_ABCDEF_ABC:45_CBA.
Note that (?s) in the PCRE patterns is necessary in case your strings may contain a newline (and . in a PCRE pattern does not match newline chars by default).
Arguably the safest thing to do is using a negative lookahead to find the last occurrence:
_ABC(?:(?!_ABC).)+$
Demo
gsub("_ABC(?:(?!_ABC).)+$", "_CBA","_AB:C-_ABCDEF_ABC:45_ABC:454:", perl=TRUE)
[1] "_AB:C-_ABCDEF_ABC:45_CBA"
Using gsub and back referencing
gsub("(.*)ABC.*$", "\\1CBA","_AB:C-_ABCDEF_ABC:45_ABC:454:")
[1] "_AB:C-_ABCDEF_ABC:45_CBA"

Find hyphonated words with Regex in list R

I have a string of semicolon-separated elements and I want to find if a pattern matchs with any of the elements in the string:
string <- "CPT1B;CPT1B;CPT1B;CHKB-CPT1B;CPT1B;CPT1B;CPT1B;CPT1B"
I want to know which regex use to match any of these elements, I mean, I want to get TRUE if any of the elements match with, for example, "CPT1B", to do so I use:
grepl(paste("[^;]","CPT1B,"[$;]",sep = ""),string)
TRUE
I used "[^;]" and "[$;]" because I want to get TRUE if any of the elements match.
My problem came when I try to match with "CHKB-CPT1B", because if I use the same expression:
grepl(paste("[^;]","CHKB-CPT1B","[$;]",sep = ""),string)
FALSE
I get FALSE, I think it's due to the hyphen in the word, and I'd like to know how to make grepl read the word with the hyphen as one word.
I don't want to use "CHKB\-CPT1B", because this pattern would came from an iterator that could be both hyphenated and non-hyphenated words. And I would also like not to split the original string by ";"
You need to use alternation groups:
grepl(paste0("(?:^|;)", "CPT1B", "(?:$|;)"),string)
[1] TRUE
The (?:^|;) non-capturing group matches start of string or ; and (?:$|;) matches either the end of string or ;.
You also may use lookarounds with perl=TRUE (i.e. a PCRE pattern):
grepl(paste0("(?<![^;])", "CPT1B", "(?![^;])"),string, perl=TRUE)
Here, the negative lookbehind (?<![^;]) matches any location that is immediately preceded with ; or start of string, and the negative lookahead (?![^;]) requires the next char to be ; or end of string location.

Regex to maintain matched parts

I would like to achieve this result : "raster(B04) + raster(B02) - raster(A10mB03)"
Therefore, I created this regex: B[0-1][0-9]|A[1,2,6]0m/B[0-1][0-9]"
I am now trying to replace all matches of the string "B04 + B02 - A10mB03" with gsub("B[0-1][0-9]]|[A[1,2,6]0mB[0-1][0-9]", "raster()", string)
How could I include the original values B01, B02, A10mB03?
PS: I also tried gsub("B[0-1][0-9]]|[A[1,2,6]0mB[0-1][0-9]", "raster(\\1)", string) but it did not work.
Basically, you need to match some text and re-use it inside a replacement pattern. In base R regex methods, there is no way to do that without a capturing group, i.e. a pair of unescaped parentheses, enclosing the whole regex pattern in this case, and use a \\1 replacement backreference in the replacement pattern.
However, your regex contains some issues: [A[1,2,6] gets parsed as a single character class that matches A, [, 1, ,, 2 or 6 because you placed a [ before A. Also, note that , inside character classes matches a literal comma, and it is not what you expected. Another, similar issue, is with [0-9]] - it matches any ASCII digit with [0-9] and then a ] (the ] char does not have to be escaped in a regex pattern).
So, a potential fix for you expression can look like
gsub("(B[0-1][0-9]|A[126]0mB[0-1][0-9])", "raster(\\1)", string)
Or even just matching 1 or more word chars (considering the sample string you supplied)
gsub("(\\w+)", "raster(\\1)", string)
might do.
See the R demo online.

Resources