For my thesis I am implementing a parser/lexer for Julia, however some areas are being a bit of a problem.
For background Julia has a special token that gives the transpose (`), also there is 'command string' that uses this same token to wrap the string (`command`). The problem I am having is that I can't seem to get a regex that will match properly.
i.e.
this should match for a transpose:
a`
as well as
a` b`
and
a`
b`
and this should match the command string
`a`
and also:
` a
b `
The issue I'm having is that either, when there's 2 transposes in a file it will match the command string, or when there is new line in a command string then the parser will fail as both are seen as only a transpose, to me this seems like they are mutually exclusive.
The regexes in the order in which they are in the lexer are:
option 1:
COMMAND
: '`' (ESC|.)*? '`'
;
TRANSPOSE
: '\'' | '`'
;
option 2:
COMMAND
: '`' ( '\\' | ~[\\\r\n\f] )* '`'
;
TRANSPOSE
: '\'' | '`'
;
As has been noted in the comments, the transpose operator in Julia is actually ', rather than `. What has not been noted yet is there is at least one one critical diffence between how ' is used and how ` is used) that makes your job a lot easier. Specifically:
Unlike `, which may be used to quote command strings of any length, ' is only ever used to quote characters. Consequently, the only valid uses of ' as a quotation are of single characters (e.g. `a`) or one of the special ANSI escape sequences beginning with \ such as `\n` for newline (the full list being to my knowledge \a, \b, \f, \n, \r, \t, \v, \`, \", and \\).
Consequently, the 's in a sequence like [a' b']can only possibly be interpreted as transposes since ' b' is not a valid Char
While juxtaposition can be taken to mean multiplication in Julia, and multiplication can in turn be used to concatenate strings in Julia (long story -- string concatenation is the associative, noncommutative binary operation of a free monoid and thus analogous to multiplication), juxtaposition is not currently allowed as a way to multiply strings or characters.
Consequently, a sequence like a'b' can only be interpreted as a' * b' and not a * 'b'.
Combining these two more broadly, unless I am missing some edge case, it appears that a new ' following any character other than whitespace, parentheses, or a valid infix operator, is always parsed as transpose, rather than the opening quote of a character literal.
Related
I am journeying through the Ada 2012 RM and would like to see if there is a hole in my understanding or a hole in the RM. Assuming that
put_line ("-- this is a not a comment");
is legal code, how can I deduce its legality from the RM, since section 2.7 states that "a comment starts with two adjacent hyphens and extends up to the end of the line.", while section 2.6 states "a string_literal is formed by a sequence of graphic characters (possibly none) enclosed between two
quotation marks used as string brackets." It seems like there is tension between the two sections and that 2.7 would win, but that is apparently not the case.
To get a clearer understanding here, you need to have a look at section 2.2 in the RM.
2.2 (1), which states;
The text of each compilation is a sequence of separate lexical elements. Each lexical element is formed from a sequence of characters, and is either a delimiter, an identifier, a reserved word, a numeric_literal, a character_literal, a string_literal, or a comment. The meaning of a program depends only on the particular sequences of lexical elements that form its compilations, excluding comments.
And 2.2 (3/2) which states:
"[In some cases an explicit separator is required to separate adjacent lexical elements.] A separator is any of a separator_space space character, a format_effector format effector, or the end of a line, as follows:
A separator_space space character is a separator except within a comment, a string_literal, or a character_literal.
The character whose code point position is 16#09# (CHARACTER TABULATION) Character tabulation (HT) is a separator except within a comment.
The end of a line is always a separator.
One or more separators are allowed between any two adjacent lexical elements, before the first of each compilation, or after the last."
and
A delimiter is either one of the following special characters:
& ' ( ) * + , – . / : ; < = > |
or one of the following compound delimiters each composed of two adjacent special characters
=> .. ** := /= >= <= << >> <>
Each of the special characters listed for single character delimiters is a single delimiter except if this character is used as a character of a compound delimiter, or as a character of a comment, string_literal, character_literal, or numeric_literal.
So, once you filter out the white-space of a program text and break it down into a sequence of lexical elements, a lexical element corresponding to a string literal begins with a double quote character, and a lexical element corresponding to a comment begins with --.
These are clearly different syntax items, and do not conflict with each other.
This also explains why;
X := A - -1
+ B;
gives a different result than;
X := A --1
+ B;
The space separator between the dashes makes the first minus a different lexical element than the -1, so -1 is a numeric literal in the first case, while the --1 is a comment.
I am trying to use a grep formula to search for at least one of the following terms in quotations in the code below in df$AllPrograms.
grep("Service & Product Provider (Partner;ACT)" | "Buildings (Prospect;INA)", df$AllPrograms)
This isn't working and I suspect it is because grep is not interpret ting the & ; and () as operators rather than characters.
Use a double backslash "\" to escape these characters. This is because the backslash is an escape character in extended regex, but we need to "escape" the first backslash as well.
Also, in your example code you have incorrectly specified the OR statement. Try:
grep("Service \\& Product Provider \\(Partner\\;ACT\\)|Buildings \\(Prospect\\;INA\\)", df$AllPrograms)
If there are many other patterns that you'd like to check for, take a look at this link here:
grep using a character vector with multiple patterns
What is a correct regular expression to extract the string "(procedure)" -or in general text from inside the parenthesis - from the strings below
input string examples are
Positron emission tomography using flutemetamol (18F) with computed
tomography of brain (procedure)
another example
Urinary tract infection prophylaxis (procedure)
Possible approaches are:
Go to end of the text, and look for first opening parenthesis and take subset from that position to the end of the text
from beginning of text, identify last '(' char and do that position to end as substring
Other strings can be (different "tag" is extracted)
[1] "Xanthoma of eyelid (disorder)" "Ventricular tachyarrhythmia (disorder)"
[3] "Abnormal urine odor (finding)" "Coloboma of iris (disorder)"
[5] "Macroencephaly (disorder)" "Right main coronary artery thrombosis (disorder)"
(general regex is sought) (or a solution in R is even better)
If it is the last part of the string then this regex will do it:
/\(([^()]*)\)$/
Explaination: Look for an open ( and match everything in between it that isn't ( or ) and then has a ) at the end of the string.
https://regex101.com/r/cEsQtf/1
sub can do that with the right regex
Text = c("Positron emission tomography using flutemetamol (18F)
with computed tomography of brain (procedure)",
"Urinary tract infection prophylaxis (procedure)",
"Xanthoma of eyelid (disorder)",
"Ventricular tachyarrhythmia (disorder)",
"Abnormal urine odor (finding)",
"Coloboma of iris (disorder)",
"Macroencephaly (disorder)",
"Right main coronary artery thrombosis (disorder)")
sub(".*\\((.*)\\).*", "\\1", Text)
[1] "procedure" "procedure" "disorder" "disorder" "finding" "disorder"
[7] "disorder" "disorder"
Addendum: Detailed explanation of the regex
The question asks to find the content of the final set of parentheses in the strings. This expression is slightly confusing because it includes two different uses of parentheses, One is to represent parentheses in the string being processed and the other is to set up a "capturing group", the way that we specify what part should be returned by the expression. The expression is made up of five basic units:
1. Initial .* - matches everything up to the final open parenthesis.
Note that this is relying on "greedy matching"
2. \\( ... \\) - matches the final set of parentheses.
Because ( by itself means something else, we need to "escape" the
parentheses by preceding them with \. That is we want the regular
expression to say \( ... \). However, the way R interprets strings,
if we just typed \( and \), R would interpret the \ as escaping the (
and so interpret this as just ( ... ). So we escape the backslash.
R will interpret \\( ... \\) as \( ... \) meaning the literal
characters ( & ).
3. ( ... ) Inside the pair in part 2
This is making use of the special meaning of parentheses. When we
enclose an expression in parentheses, whatever value is inside them
will be stored in a variable for later use. That variable is called
\1, which is what was used in the substitution pattern. Again, is
we just wrote \1, R would interpret it as if we were trying to escape
the 1. Writing \\1 is interpreted as the character \ followed by 1,
i.e. \1.
4. Central .* Inside the pair in part 3
This is what we are looking for, all characters inside the parentheses.
5. Final .*
This is in the expression to match any characters that may follow the
final set of parentheses.
The sub function will use this to replace the matched pattern (in this case, all characters in the string) with the substitution pattern \1 i.e. the contents of the variable containing whatever was in the first (in our case only) capturing group - the stuff inside the final parentheses.
You can actually use the following to extract the text inside nested parentheses at the end of string:
x <- c("FELON IN POSSESSION OF AMMUNITION (ACTUAL POSSESSION) (79023)",
"FAIL TO DISPLAY REGISTRATION - POSSESSION REQUIRED (320.0605(1))")
sub(".*(\\(((?:[^()]++|(?1))*)\\))$", "\\2", x, perl=TRUE)
See the online R demo and the regex demo.
Details:
.* - any zero or more chars other than line break chars, as many as possible
(\(((?:[^()]++|(?1))*)\)) - Capturing group 1 (necessary for recursion to take place):
\( - a ( char
((?:[^()]++|(?1))*) - Capturing group 2 (our value): zero or more occurrences of any one or more chars other than ( and ), or the whole Group 1 pattern
\) - a ) char
$ - end of string.
The whole string is thus, when matched, replaced with the value of Group 2. If there is no match, the string remains what it was.
I want to split this string in several substrings:
BAA33520.2|/gene="vpf402",/product="Vpf402"|GI:8272373|AB012574|join{7347:7965,
0:591}
The separator is | (ascii 124).
It works with all other separators but not with this one.
?regex
Two regular expressions may be joined by the infix operator |; the resulting regular expression matches any string matching either subexpression. For example, abba|cde matches either the string abba or the string cde. Note that alternation does not work inside character classes, where | has its literal meaning.
The fundamental building blocks are the regular expressions that match a single character. Most characters, including all letters and digits, are regular expressions that match themselves. Any metacharacter with special meaning may be quoted by preceding it with a backslash. The metacharacters in extended regular expressions are . \ | ( ) [ { ^ $ * + ?, but note that whether these have a special meaning depends on the context.
Thus:
stringr::str_split('BAA33520.2|/gene="vpf402",/product="Vpf402"|GI:8272373|AB012574|join{7347:7965, 0:591}', "\\|")
As #Frank noted, you can do this in base::strsplit() by adding the fixed=TRUE:
strsplit('BAA33520.2|/gene="vpf402",/product="Vpf402"|GI:8272373|AB012574|join{7347:7965, 0:591}',"|", fixed=TRUE)
However, you can also do this with stringr::str_split() by decorating the regular expression for the separator:
stringr::str_split('BAA33520.2|/gene="vpf402",/product="Vpf402"|GI:8272373|AB012574|join{7347:7965, 0:591}',
regex("|", literal=TRUE))
Incidentally, stringr is pretty much just a slightly friendlier wrapper to stringi functions at this point and I highly recommend studying the stringi package as it contains some wonderful gems outside of string spiltting.
Windows copies path with backslash \, which R does not accept. So, I wanted to write a function which would convert \ to /. For example:
chartr0 <- function(foo) chartr('\','\\/',foo)
Then use chartr0 as...
source(chartr0('E:\RStuff\test.r'))
But chartr0 is not working. I guess, I am unable to escape /. I guess escaping / may be important in many other occasions.
Also, is it possible to avoid the use chartr0 every time, but convert all path automatically by creating an environment in R which calls chartr0 or use some kind of temporary use like using options
From R 4.0.0 you can use r"(...)" to write a path as raw string constant, which avoids the need for escaping:
r"(E:\RStuff\test.r)"
# [1] "E:\\RStuff\\test.r"
There is a new syntax for specifying raw character constants similar to the one used in C++: r"(...)" with ... any character sequence not containing the sequence )". This makes it easier to write strings that contain backslashes or both single and double quotes. For more details see ?Quotes.
Your fundamental problem is that R will signal an error condition as soon as it sees a single back-slash before any character other than a few lower-case letters, backslashes themselves, quotes or some conventions for entering octal, hex or Unicode sequences. That is because the interpreter sees the back-slash as a message to "escape" the usual translation of characters and do something else. If you want a single back-slash in your character element you need to type 2 backslashes. That will create one backslash:
nchar("\\")
#[1] 1
The "Character vectors" section of _Intro_to_R_ says:
"Character strings are entered using either matching double (") or single (') quotes, but are printed using double quotes (or sometimes without quotes). They use C-style escape sequences, using \ as the escape character, so \ is entered and printed as \, and inside double quotes " is entered as \". Other useful escape sequences are \n, newline, \t, tab and \b, backspace—see ?Quotes for a full list."
?Quotes
chartr0 <- function(foo) chartr('\\','/',foo)
chartr0('E:\\RStuff\\test.r')
You cannot write E:\Rxxxx, because R believes R is escaped.
The problem is that every single forward slash and backslash in your code is escaped incorrectly, resulting in either an invalid string or the wrong string being used. You need to read up on which characters need to be escaped and how. Take a look at the list of escape sequences in the link below. Anything not listed there (such as the forward slash) is treated literally and does not require any escaping.
http://cran.r-project.org/doc/manuals/R-lang.html#Literal-constants