Julia regular expression \w+/g - julia

I'm using replace function with regular expression to find all non alphanumeric characters in string.
new = replace(cont, r"\w+/g" => "")
However, above code do nothing with a string.
If I remove "/g" it works, and it removes all words.

The /g part is not needed because by default replace replaces all matches of a pattern. If you wanted to e.g. replace one match pass count=1 keyword argument to replace.
Now, in most regular expression parsers /g would be an invalid sequence, but in Julia it is accepted, and just matches /g verbatim, see:
julia> match(r"\w+/g", "##abc/gab##")
RegexMatch("abc/g")
julia> replace("##abc/gab##", r"\w+/g" => "")
"##ab##"
as is explained here the /g flag in e.g. Perl can be found at the end of regular expression constructs, but is not a generic regular expression flag, but applies to the operation being performed.
In Julia the allowed flags for regex are listed here and they are i, m, s, and x and are suffixed after the regular expression, e.g. r"a+.*b+.*?d$"ism.

Related

Is there a way to do a negative match using regex sub?

Say I have a vector of strings,
g<-c("bunchofstuff>query=true/fun/weird>bunchofstuff", "bunchofstuff>query=animals/octopus/weird>bunchofstuff", "bunchofstuff>query=flowers/sunshine/fun>bunchofstuff", "
bunchofstuff>query=fun/true/sunshine>bunchofstuff"
and I want to essentially use sub to erase anything after query=, until the end of the string, IF query= is not followed by true (ideally in any position). As far as I can tell, there isn't a useful substitution for ! in sub (seems to be some workarounds in grepl).
What I want is
newvariable<-c("bunchofstuff>query=true/fun/weird>bunchofstuff", "bunchofstuff>query=", "bunchofstuff>query=", "bunchofstuff>query=fun/true/sunshine>bunchofstuff"
You can do that:
sub('query=\\K(?:(?!true).)+$', '', g, perl=TRUE)
This technique uses a negative lookahead assertion (?!true) that checks before each character . if "true" doesn't follow. All is in a non-capturing group repeated until the end of the string $.
\\K is used to start the matched string after it to preserve the query= substring. (Note that it's only a convenient way to avoid a capture group or to rewrite query= in the replacement string.)
You can be more specific using word-boundaries to be sure that "true" isn't a part of another word:
sub('query=\\K(?:(?!\\btrue\\b).)+$', '', g, perl=TRUE)

REGEX pattern match in R for Course number

I need to identify matching course number that have xx.3xxxxxx.
These are some examples of the course numbers.
26.3730004
27.0210000
26.3730009
26.7114001
23.9610071
26.0A34430
23.3670005
26.0B05430
I tried many patterns one example I used is the pattern below. It did not get any match.
"[^0-9]{2}\Q.\E3[^0-9]+$"
I tried using grep and grepl. I actually need the code to return indexes.
This code shows my attempt to tag the rows that have matches.
Teacher$virtual[
which(
grepl("[^0-9]{2}\\Q.\\E3[^0-9]+$",Teacher$CourseNumber))]
<- "1"
I need to remove any row from my dataframe that have the course number with that pattern. XX.3XXXXXX
But, my code did not find any match. Can you please help me?
You should use
grepl("^[0-9]{2}\\.3", Teacher$CourseNumber)
See the regex graph:
Details:
^ - start of a string
[0-9]{2} - two digits
\\. - a dot (note that a regex escape is a literal backslash, but inside a string literal, "...", a single backslash is used to form string escape sequences, hence the backslash must be double to obtain a literal backslash char necessary for a regex escape)
3 - a 3 char.
NOTE: If you want to use in-pattern quoting with \Q and \E (in between which all chars are treated literally) you need to use PCRE regex, add perl=TRUE and use
grepl("^[0-9]{2}\\Q.\\E3", Teacher$CourseNumber, perl=TRUE)
Now, the dot is treated as a literal dot, not a . metacharacter that matches any char but a line break char (in a PCRE regex, . does not match line break chars by default).
Here, this simple expression would likely cover that:
^[0-9]{2}\.[3].+$
which has a [3] boundary right after the .. It would probably work without start and end anchors:
[0-9]{2}\.[3].+
Demo
We can add or reduce the boundaries, if it'd be necessary.

R - replace last instance of a regex match and everything afterwards

I'm trying to use a regex to replace the last instance of a phrase (and everything after that phrase, which could be any character):
stringi::stri_replace_last_regex("_AB:C-_ABCDEF_ABC:45_ABC:454:", "_ABC.*$", "CBA")
However, I can't seem to get the refex to function properly:
Input: "_AB:C-_ABCDEF_ABC:45_ABC:454:"
Actual output: "_AB:C-CBA"
Desired output: "_AB:C-_ABCDEF_ABC:45_CBA"
I have tried gsub() as well but that hasn't worked.
Any ideas where I'm going wrong?
One solution is:
sub("(.*)_ABC.*", "\\1_CBA", Input)
[1] "_AB:C-_ABCDEF_ABC:45_CBA"
Have a look at what stringi::stri_replace_last_regex does:
Replaces with the given replacement string last substring of the input that matches a regular expression
What does your _ABC.*$ pattern match inside _AB:C-_ABCDEF_ABC:45_ABC:454:? It matches the first _ABC (that is right after C-) and all the text after to the end of the line (.*$ grabs 0+ chars other than line break chars to the end of the line). Hence, you only have 1 match, and it is the last.
Solutions can be many:
1) Capturing all text before the last occurrence of the pattern and insert the captured value with a replacement backreference (this pattern does not have to be anchored at the end of the string with $):
sub("(.*)_ABC.*", "\\1_CBA","_AB:C-_ABCDEF_ABC:45_ABC:454:")
2) Using a tempered greedy token to make sure you only match any char that does not start your pattern up to the end of the string after matching it (this pattern must be anchored at the end of the string with $):
sub("(?s)_ABC(?:(?!_ABC).)*$", "_CBA","_AB:C-_ABCDEF_ABC:45_ABC:454:", perl=TRUE)
Note that this pattern will require perl=TRUE argument to be parsed with a PCRE engine with sub (or you may use stringr::str_replace that is ICU regex library powered and supports lookaheads)
3) A negative lookahead may be used to make sure your pattern does not appear anywhere to the right of your pattern (this pattern does not have to be anchored at the end of the string with $):
sub("(?s)_ABC(?!.*_ABC).*", "_CBA","_AB:C-_ABCDEF_ABC:45_ABC:454:", perl=TRUE)
See the R demo online, all these three lines of code returning _AB:C-_ABCDEF_ABC:45_CBA.
Note that (?s) in the PCRE patterns is necessary in case your strings may contain a newline (and . in a PCRE pattern does not match newline chars by default).
Arguably the safest thing to do is using a negative lookahead to find the last occurrence:
_ABC(?:(?!_ABC).)+$
Demo
gsub("_ABC(?:(?!_ABC).)+$", "_CBA","_AB:C-_ABCDEF_ABC:45_ABC:454:", perl=TRUE)
[1] "_AB:C-_ABCDEF_ABC:45_CBA"
Using gsub and back referencing
gsub("(.*)ABC.*$", "\\1CBA","_AB:C-_ABCDEF_ABC:45_ABC:454:")
[1] "_AB:C-_ABCDEF_ABC:45_CBA"

Regex to maintain matched parts

I would like to achieve this result : "raster(B04) + raster(B02) - raster(A10mB03)"
Therefore, I created this regex: B[0-1][0-9]|A[1,2,6]0m/B[0-1][0-9]"
I am now trying to replace all matches of the string "B04 + B02 - A10mB03" with gsub("B[0-1][0-9]]|[A[1,2,6]0mB[0-1][0-9]", "raster()", string)
How could I include the original values B01, B02, A10mB03?
PS: I also tried gsub("B[0-1][0-9]]|[A[1,2,6]0mB[0-1][0-9]", "raster(\\1)", string) but it did not work.
Basically, you need to match some text and re-use it inside a replacement pattern. In base R regex methods, there is no way to do that without a capturing group, i.e. a pair of unescaped parentheses, enclosing the whole regex pattern in this case, and use a \\1 replacement backreference in the replacement pattern.
However, your regex contains some issues: [A[1,2,6] gets parsed as a single character class that matches A, [, 1, ,, 2 or 6 because you placed a [ before A. Also, note that , inside character classes matches a literal comma, and it is not what you expected. Another, similar issue, is with [0-9]] - it matches any ASCII digit with [0-9] and then a ] (the ] char does not have to be escaped in a regex pattern).
So, a potential fix for you expression can look like
gsub("(B[0-1][0-9]|A[126]0mB[0-1][0-9])", "raster(\\1)", string)
Or even just matching 1 or more word chars (considering the sample string you supplied)
gsub("(\\w+)", "raster(\\1)", string)
might do.
See the R demo online.

How to use str_split() in R?

I want to split this string in several substrings:
BAA33520.2|/gene="vpf402",/product="Vpf402"|GI:8272373|AB012574|join{7347:7965,
0:591}
The separator is | (ascii 124).
It works with all other separators but not with this one.
?regex
Two regular expressions may be joined by the infix operator |; the resulting regular expression matches any string matching either subexpression. For example, abba|cde matches either the string abba or the string cde. Note that alternation does not work inside character classes, where | has its literal meaning.
The fundamental building blocks are the regular expressions that match a single character. Most characters, including all letters and digits, are regular expressions that match themselves. Any metacharacter with special meaning may be quoted by preceding it with a backslash. The metacharacters in extended regular expressions are . \ | ( ) [ { ^ $ * + ?, but note that whether these have a special meaning depends on the context.
Thus:
stringr::str_split('BAA33520.2|/gene="vpf402",/product="Vpf402"|GI:8272373|AB012574|join{7347:7965, 0:591}', "\\|")
As #Frank noted, you can do this in base::strsplit() by adding the fixed=TRUE:
strsplit('BAA33520.2|/gene="vpf402",/product="Vpf402"|GI:8272373|AB012574|join{‌​7347:7965, 0:591}',"|", fixed=TRUE)
However, you can also do this with stringr::str_split() by decorating the regular expression for the separator:
stringr::str_split('BAA33520.2|/gene="vpf402",/product="Vpf402"|GI:8272373|AB012574|join{7347:7965, 0:591}',
regex("|", literal=TRUE))
Incidentally, stringr is pretty much just a slightly friendlier wrapper to stringi functions at this point and I highly recommend studying the stringi package as it contains some wonderful gems outside of string spiltting.

Resources