Correcting the regex "\[([a-zA-Z0-9_-]+)]" - common-lisp

The following cl-ppcre regular expression generates an error:
(ppcre:scan-to-strings "\[([a-zA-Z0-9_-]+)]" "[has-instance]")
debugger invoked on a CL-PPCRE:PPCRE-SYNTAX-ERROR in thread
#<THREAD "main thread" RUNNING {10010B0523}>:
Expected end of string. at position 16 in string "[([a-zA-Z0-9_-]+)]"
What I was expecting as return values is:
“[has-instance]”
#(“has-instance”)
in order to get at the string within the brackets. Can someone provide a regex correction? Thanks.

The escape character (backslash) only escapes itself and double quotes (§2.4.5 Double-Quote):
If a single escape character is seen, the single escape character is discarded, the next character is accumulated, and accumulation continues.
That means that:
"\[([a-zA-Z0-9_-]+)]"
is parsed the same as the following, where backslash is not present:
"[([a-zA-Z0-9_-]+)]"
The PCRE syntax implemented by CL-PPCRE understands the opening square bracket as a special syntax for character classes, and ends at the next closing bracket.
Thus, the above reads the following as a class:
[([a-zA-Z0-9_-]
The corresponding regex tree is:
CL-USER> (ppcre:parse-string "[([a-zA-Z0-9_-]")
(:CHAR-CLASS #\( #\[ (:RANGE #\a #\z) (:RANGE #\A #\Z) (:RANGE #\0 #\9) #\_ #\-)
Note in particular that the opening parenthesis inside it is treated literally. When the parser encounters the closing parenthesis that follows the above fragment, it interprets it as the end of a register group, but no such group was started, hence the error message at position 16 of the string.
To avoid treating the bracket as a character class, it must be preceded by a literal backslash in the string, as you tried to do, but in order to do so you must write two backslash characters:
CL-USER> (ppcre:parse-string "\\[([a-zA-Z0-9_-]+)]")
(:SEQUENCE #\[
(:REGISTER
(:GREEDY-REPETITION 1 NIL
(:CHAR-CLASS (:RANGE #\a #\z) (:RANGE #\A #\Z) (:RANGE #\0 #\9) #\_ #\-)))
#\])
The closing square brackets needs no backslash.
I encourage you to write regular expressions in Lisp using the tree form, with :regex terms when it improves clarity: it avoids having to deal with the kind of problems that escaping brings. For example:
CL-USER> (ppcre:scan-to-strings
'(:sequence "[" (:register (:regex "[a-zA-Z0-9_-]+")) "]")
"[has-instance]")
"[has-instance]"
#("has-instance")

Double escape the square brackets.
You forgot to (double) escape the closing bracket, too.
(cl-ppcre:scan-to-strings "\\[([a-zA-Z0-9_-]+)\\]" "[has-instance]")
;; "[has-instance]" ;
;; #("has-instance")
For those who are new to common lisp, you import cl-ppcre using quicklisp:
(load "~/quicklisp/setup.list") ;; adjust path to where you installed your quicklisp
(ql:quickload :cl-ppcre)

Related

Julia regular expression \w+/g

I'm using replace function with regular expression to find all non alphanumeric characters in string.
new = replace(cont, r"\w+/g" => "")
However, above code do nothing with a string.
If I remove "/g" it works, and it removes all words.
The /g part is not needed because by default replace replaces all matches of a pattern. If you wanted to e.g. replace one match pass count=1 keyword argument to replace.
Now, in most regular expression parsers /g would be an invalid sequence, but in Julia it is accepted, and just matches /g verbatim, see:
julia> match(r"\w+/g", "##abc/gab##")
RegexMatch("abc/g")
julia> replace("##abc/gab##", r"\w+/g" => "")
"##ab##"
as is explained here the /g flag in e.g. Perl can be found at the end of regular expression constructs, but is not a generic regular expression flag, but applies to the operation being performed.
In Julia the allowed flags for regex are listed here and they are i, m, s, and x and are suffixed after the regular expression, e.g. r"a+.*b+.*?d$"ism.

Distinguish between transpose and command string in Julia lexer

For my thesis I am implementing a parser/lexer for Julia, however some areas are being a bit of a problem.
For background Julia has a special token that gives the transpose (`), also there is 'command string' that uses this same token to wrap the string (`command`). The problem I am having is that I can't seem to get a regex that will match properly.
i.e.
this should match for a transpose:
a`
as well as
a` b`
and
a`
b`
and this should match the command string
`a`
and also:
` a
b `
The issue I'm having is that either, when there's 2 transposes in a file it will match the command string, or when there is new line in a command string then the parser will fail as both are seen as only a transpose, to me this seems like they are mutually exclusive.
The regexes in the order in which they are in the lexer are:
option 1:
COMMAND
: '`' (ESC|.)*? '`'
;
TRANSPOSE
: '\'' | '`'
;
option 2:
COMMAND
: '`' ( '\\' | ~[\\\r\n\f] )* '`'
;
TRANSPOSE
: '\'' | '`'
;
As has been noted in the comments, the transpose operator in Julia is actually ', rather than `. What has not been noted yet is there is at least one one critical diffence between how ' is used and how ` is used) that makes your job a lot easier. Specifically:
Unlike `, which may be used to quote command strings of any length, ' is only ever used to quote characters. Consequently, the only valid uses of ' as a quotation are of single characters (e.g. `a`) or one of the special ANSI escape sequences beginning with \ such as `\n` for newline (the full list being to my knowledge \a, \b, \f, \n, \r, \t, \v, \`, \", and \\).
Consequently, the 's in a sequence like [a' b']can only possibly be interpreted as transposes since ' b' is not a valid Char
While juxtaposition can be taken to mean multiplication in Julia, and multiplication can in turn be used to concatenate strings in Julia (long story -- string concatenation is the associative, noncommutative binary operation of a free monoid and thus analogous to multiplication), juxtaposition is not currently allowed as a way to multiply strings or characters.
Consequently, a sequence like a'b' can only be interpreted as a' * b' and not a * 'b'.
Combining these two more broadly, unless I am missing some edge case, it appears that a new ' following any character other than whitespace, parentheses, or a valid infix operator, is always parsed as transpose, rather than the opening quote of a character literal.

R - replace last instance of a regex match and everything afterwards

I'm trying to use a regex to replace the last instance of a phrase (and everything after that phrase, which could be any character):
stringi::stri_replace_last_regex("_AB:C-_ABCDEF_ABC:45_ABC:454:", "_ABC.*$", "CBA")
However, I can't seem to get the refex to function properly:
Input: "_AB:C-_ABCDEF_ABC:45_ABC:454:"
Actual output: "_AB:C-CBA"
Desired output: "_AB:C-_ABCDEF_ABC:45_CBA"
I have tried gsub() as well but that hasn't worked.
Any ideas where I'm going wrong?
One solution is:
sub("(.*)_ABC.*", "\\1_CBA", Input)
[1] "_AB:C-_ABCDEF_ABC:45_CBA"
Have a look at what stringi::stri_replace_last_regex does:
Replaces with the given replacement string last substring of the input that matches a regular expression
What does your _ABC.*$ pattern match inside _AB:C-_ABCDEF_ABC:45_ABC:454:? It matches the first _ABC (that is right after C-) and all the text after to the end of the line (.*$ grabs 0+ chars other than line break chars to the end of the line). Hence, you only have 1 match, and it is the last.
Solutions can be many:
1) Capturing all text before the last occurrence of the pattern and insert the captured value with a replacement backreference (this pattern does not have to be anchored at the end of the string with $):
sub("(.*)_ABC.*", "\\1_CBA","_AB:C-_ABCDEF_ABC:45_ABC:454:")
2) Using a tempered greedy token to make sure you only match any char that does not start your pattern up to the end of the string after matching it (this pattern must be anchored at the end of the string with $):
sub("(?s)_ABC(?:(?!_ABC).)*$", "_CBA","_AB:C-_ABCDEF_ABC:45_ABC:454:", perl=TRUE)
Note that this pattern will require perl=TRUE argument to be parsed with a PCRE engine with sub (or you may use stringr::str_replace that is ICU regex library powered and supports lookaheads)
3) A negative lookahead may be used to make sure your pattern does not appear anywhere to the right of your pattern (this pattern does not have to be anchored at the end of the string with $):
sub("(?s)_ABC(?!.*_ABC).*", "_CBA","_AB:C-_ABCDEF_ABC:45_ABC:454:", perl=TRUE)
See the R demo online, all these three lines of code returning _AB:C-_ABCDEF_ABC:45_CBA.
Note that (?s) in the PCRE patterns is necessary in case your strings may contain a newline (and . in a PCRE pattern does not match newline chars by default).
Arguably the safest thing to do is using a negative lookahead to find the last occurrence:
_ABC(?:(?!_ABC).)+$
Demo
gsub("_ABC(?:(?!_ABC).)+$", "_CBA","_AB:C-_ABCDEF_ABC:45_ABC:454:", perl=TRUE)
[1] "_AB:C-_ABCDEF_ABC:45_CBA"
Using gsub and back referencing
gsub("(.*)ABC.*$", "\\1CBA","_AB:C-_ABCDEF_ABC:45_ABC:454:")
[1] "_AB:C-_ABCDEF_ABC:45_CBA"

Regex: How to extract text from last parenthesis

What is a correct regular expression to extract the string "(procedure)" -or in general text from inside the parenthesis - from the strings below
input string examples are
Positron emission tomography using flutemetamol (18F) with computed
tomography of brain (procedure)
another example
Urinary tract infection prophylaxis (procedure)
Possible approaches are:
Go to end of the text, and look for first opening parenthesis and take subset from that position to the end of the text
from beginning of text, identify last '(' char and do that position to end as substring
Other strings can be (different "tag" is extracted)
[1] "Xanthoma of eyelid (disorder)" "Ventricular tachyarrhythmia (disorder)"
[3] "Abnormal urine odor (finding)" "Coloboma of iris (disorder)"
[5] "Macroencephaly (disorder)" "Right main coronary artery thrombosis (disorder)"
(general regex is sought) (or a solution in R is even better)
If it is the last part of the string then this regex will do it:
/\(([^()]*)\)$/
Explaination: Look for an open ( and match everything in between it that isn't ( or ) and then has a ) at the end of the string.
https://regex101.com/r/cEsQtf/1
sub can do that with the right regex
Text = c("Positron emission tomography using flutemetamol (18F)
with computed tomography of brain (procedure)",
"Urinary tract infection prophylaxis (procedure)",
"Xanthoma of eyelid (disorder)",
"Ventricular tachyarrhythmia (disorder)",
"Abnormal urine odor (finding)",
"Coloboma of iris (disorder)",
"Macroencephaly (disorder)",
"Right main coronary artery thrombosis (disorder)")
sub(".*\\((.*)\\).*", "\\1", Text)
[1] "procedure" "procedure" "disorder" "disorder" "finding" "disorder"
[7] "disorder" "disorder"
Addendum: Detailed explanation of the regex
The question asks to find the content of the final set of parentheses in the strings. This expression is slightly confusing because it includes two different uses of parentheses, One is to represent parentheses in the string being processed and the other is to set up a "capturing group", the way that we specify what part should be returned by the expression. The expression is made up of five basic units:
1. Initial .* - matches everything up to the final open parenthesis.
Note that this is relying on "greedy matching"
2. \\( ... \\) - matches the final set of parentheses.
Because ( by itself means something else, we need to "escape" the
parentheses by preceding them with \. That is we want the regular
expression to say \( ... \). However, the way R interprets strings,
if we just typed \( and \), R would interpret the \ as escaping the (
and so interpret this as just ( ... ). So we escape the backslash.
R will interpret \\( ... \\) as \( ... \) meaning the literal
characters ( & ).
3. ( ... ) Inside the pair in part 2
This is making use of the special meaning of parentheses. When we
enclose an expression in parentheses, whatever value is inside them
will be stored in a variable for later use. That variable is called
\1, which is what was used in the substitution pattern. Again, is
we just wrote \1, R would interpret it as if we were trying to escape
the 1. Writing \\1 is interpreted as the character \ followed by 1,
i.e. \1.
4. Central .* Inside the pair in part 3
This is what we are looking for, all characters inside the parentheses.
5. Final .*
This is in the expression to match any characters that may follow the
final set of parentheses.
The sub function will use this to replace the matched pattern (in this case, all characters in the string) with the substitution pattern \1 i.e. the contents of the variable containing whatever was in the first (in our case only) capturing group - the stuff inside the final parentheses.
You can actually use the following to extract the text inside nested parentheses at the end of string:
x <- c("FELON IN POSSESSION OF AMMUNITION (ACTUAL POSSESSION) (79023)",
"FAIL TO DISPLAY REGISTRATION - POSSESSION REQUIRED (320.0605(1))")
sub(".*(\\(((?:[^()]++|(?1))*)\\))$", "\\2", x, perl=TRUE)
See the online R demo and the regex demo.
Details:
.* - any zero or more chars other than line break chars, as many as possible
(\(((?:[^()]++|(?1))*)\)) - Capturing group 1 (necessary for recursion to take place):
\( - a ( char
((?:[^()]++|(?1))*) - Capturing group 2 (our value): zero or more occurrences of any one or more chars other than ( and ), or the whole Group 1 pattern
\) - a ) char
$ - end of string.
The whole string is thus, when matched, replaced with the value of Group 2. If there is no match, the string remains what it was.

Escaping backslash (\) in string or paths in R

Windows copies path with backslash \, which R does not accept. So, I wanted to write a function which would convert \ to /. For example:
chartr0 <- function(foo) chartr('\','\\/',foo)
Then use chartr0 as...
source(chartr0('E:\RStuff\test.r'))
But chartr0 is not working. I guess, I am unable to escape /. I guess escaping / may be important in many other occasions.
Also, is it possible to avoid the use chartr0 every time, but convert all path automatically by creating an environment in R which calls chartr0 or use some kind of temporary use like using options
From R 4.0.0 you can use r"(...)" to write a path as raw string constant, which avoids the need for escaping:
r"(E:\RStuff\test.r)"
# [1] "E:\\RStuff\\test.r"
There is a new syntax for specifying raw character constants similar to the one used in C++: r"(...)" with ... any character sequence not containing the sequence )". This makes it easier to write strings that contain backslashes or both single and double quotes. For more details see ?Quotes.
Your fundamental problem is that R will signal an error condition as soon as it sees a single back-slash before any character other than a few lower-case letters, backslashes themselves, quotes or some conventions for entering octal, hex or Unicode sequences. That is because the interpreter sees the back-slash as a message to "escape" the usual translation of characters and do something else. If you want a single back-slash in your character element you need to type 2 backslashes. That will create one backslash:
nchar("\\")
#[1] 1
The "Character vectors" section of _Intro_to_R_ says:
"Character strings are entered using either matching double (") or single (') quotes, but are printed using double quotes (or sometimes without quotes). They use C-style escape sequences, using \ as the escape character, so \ is entered and printed as \, and inside double quotes " is entered as \". Other useful escape sequences are \n, newline, \t, tab and \b, backspace—see ?Quotes for a full list."
?Quotes
chartr0 <- function(foo) chartr('\\','/',foo)
chartr0('E:\\RStuff\\test.r')
You cannot write E:\Rxxxx, because R believes R is escaped.
The problem is that every single forward slash and backslash in your code is escaped incorrectly, resulting in either an invalid string or the wrong string being used. You need to read up on which characters need to be escaped and how. Take a look at the list of escape sequences in the link below. Anything not listed there (such as the forward slash) is treated literally and does not require any escaping.
http://cran.r-project.org/doc/manuals/R-lang.html#Literal-constants

Resources