Where is the actual character set for Ada program text defined? - ada

I'm trying to make a tree-sitter parser, so that IDEs (in this case, Vim) can parse and do more advanced manipulation of Ada program text, such as extract-subprogram and rename-variable. But there seem to be some problems defining the character set.
In the Ada 2012 Reference Manual, I found a list of vague category descriptions, of the form 'Any character whose General Category is X' which means that for instance, besides the underscore, all of these ( ‿ ⁀ ⁔ ︳ ︴ ﹍ ﹎ ﹏ _) are also allowed in an identifier, which seems absurd, and GNAT rejects with 'illegal character'. The list is prefaced by this statement:
"The actual set of graphic symbols used by an implementation for the visual representation of the text of an Ada program is not specified."
Does that really mean there's no way to know which characters should be accepted?
Two pages on, these examples are explicitly given as valid identifiers, and yet GNAT 2021 rejects them:
procedure Main is
Πλάτων : constant := 12; -- Plato
Чайковский : constant := 12; -- Tchaikovsky
θ, φ : constant := 12; -- Angles
begin
null;
end Main;
$ gprbuild
using project file foo.gpr
Compile
[Ada] main.adb
main.adb:2:04: error: declaration expected
main.adb:2:05: error: illegal character
main.adb:3:04: error: declaration expected
main.adb:3:05: error: illegal character
main.adb:4:05: error: illegal character
gprbuild: *** compilation phase failed
Where is the actual character set for Ada programs defined? Has GNAT 2021 got it wrong?
An example program using Unicode characters in identifiers is below for your experimentation. Note that the use of wide characters in the literal string is outside the scope of the question.
main.adb:
with Ada.Wide_Text_IO; use Ada.Wide_Text_IO;
procedure Main is
δεδομένα_πράμα : constant Wide_String := "Ο Πλάτων θα ενέκρινε";
begin
Put_Line (Δεδομένα_πράμα);
end Main;
foo.gpr
project foo is
for Source_Dirs use (".");
for Main use ("main.adb");
package Compiler is
for Default_Switches ("ada") use ("-gnatW8", "-gnatiw");
end Compiler;
end foo;
To build & run:
gprbuild
./main

All Ada versions since Ada 2005 have required that implementations support UTF-8 source code, however for Ada 83 & 95 compatibility don't require it to be the default encoding. GNAT's default source encoding is Latin-1, although it helpfully switches to UTF-8 if a byte-order mark is found. To explicitly specify file encoding, you can pass the -gnatW8 flag, or one of a number of other options.
However, while that allows UTF-8 in source files, identifiers are still limited to Latin-1 in GNAT, you must also pass the -gnatiw flag to allow wide characters in identifiers. It seems that GNAT does not default to it because you can craft very bizarre identifiers (as you noted), but also because identifiers would no longer be properly case-insensitive; GNAT does minimal case folding on any wide character set, other than characters present in other encodings it supports.
ARM § 2.3 specify the requirements for an identifier:
identifier ::= identifier_start {identifier_start | identifier_extend},
where identifier_start can be summarized as anything in Unicode general category L, and the remaining characters can be numbers, punctuation_connectors, decimal marks, and non-whitespace combining marks—with the additional restriction of “An identifier shall not contain two consecutive characters in category punctuation_connector, or end with a character in that category. ”
Going beyond your question, note that despite all these flags, strings are still encoded as Latin-1 (conflictingly, string literals are UTF-8, just not the underlying string :/). You'll need to use Ada.Strings.UTF_Encoding, Wide_Wide_Strings, and/or a library such as VSS for Unicode string handling.

Related

Does R 4.0.0. make it possible to define foo"(...)" operators, similar to the newly introduced r"(...)" syntax?

R 4.0.0 brings in a new syntax for raw strings:
r"(raw string here can contain anything except the closing sequence)"
But this same construct in R 3.x.x produced a syntax error:
Error: unexpected string constant in "r"(asdasd)""
Does it mean that the interpreter was changed in R 4.0.0. ?
And if so - does R 4.0.0. provide a mechanism to define custom functions like foo"()" ?
No, that's not possible at the moment (nor would I anticipate it becoming possible anytime soon).
Here's the NEWS item:
There is a new syntax for specifying raw character constants similar to the one used in C++: r"(...)" with ... any character sequence not containing the sequence )". This makes it easier to write strings that contain backslashes or both single and double quotes. For more details see ?Quotes.
https://cran.r-project.org/doc/manuals/r-devel/NEWS.html
Then from ?Quotes:
Raw character constants are also available using a syntax similar to
the one used in C++: r"(...)" with ... any character
sequence, except that it must not contain the closing sequence
)". The delimiter pairs [] and {} can also be
used, and R can be used in place of r. For additional
flexibility, a number of dashes can be placed between the opening quote
and the opening delimiter, as long as the same number of dashes appear
between the closing delimiter and the closing quote.
https://github.com/wch/r-source/blob/trunk/src/library/base/man/Quotes.Rd
Here's the (git mirror of the SVN patch of the) commit where this functionality was added:
https://github.com/wch/r-source/commit/8b0e58041120ddd56cd3bb0442ebc00a3ab67ebc

Julia: docstrings and LaTeX

Julia has docstrings capabilities, which are documented here https://docs.julialang.org/en/stable/manual/documentation/. I'm under the impression that it has support for LaTeX code, but I'm not sure if the intention is that the LaTeX code should look like code or like an interpretation. In the following, the LaTeX code is garbled somewhat (see rho, for instance) and not interpreted (rho does not look like ρ). Am I doing something wrong?
Is there a way to get LaTeX code look interpreted?
What I mean by interpreted is something like what they do at https://math.stackexchange.com/.
The documentation says that LaTeX code should be wrapped around double back-quotes and that Greek letters should be typed as ρ rather than \rho. But that rather defeats the point of being able to include LaTeX code, doesn't it?
Note: Version 0.5.2 run in Juno/Atom console.
"""
Module blabla
The objective function is:
``\max \mathbb{E}_0 \int_0^{\infty} e^{-\rho t} F(x_t) dt``
"""
module blabla
end
If I then execute the module and query I get this:
With triple quotes, the dollar signs disappear, but the formula is printed on a dark background:
EDIT Follow-up to David P. Sanders' suggestion to use the Documenter.jl package.
using Documenter
doc"""
Module blabla
The objective function is:
$\max \mathbb{E}_0 \int_0^{\infty} e^{-\rho t} F(x_t) dt$
"""
module blabla
end
Gives the following: the LaTeX code appears to print correctly, but it's not interpreted (ρ is displayed as \rho. I followed suggestions in: https://juliadocs.github.io/Documenter.jl/stable/man/latex.html to
Rendering LaTeX code as actual equations has to be supported by whichever software renders your docstrings. So, the short answer to why you're not seeing any rendered equations in Juno is that LaTeX rendering is currently not supported in Juno (as Matt B. pointed out, there's an open issue for that).
The doc"" string literal / string macro is there to get around another issue. Backslashes and dollar signs normally have a special meaning in string literals -- escape sequences and variable interpolation, respectively (e.g. \n gets replaced by a newline character, $(variable) inserts the value of variable into the string). This, of course, clashes with the ordinary LaTeX commands and delimiters (e.g. \frac, $...$). So, to actually have backslashes and dollar signs in a string you need to escape them all with backslashes, e.g.:
julia> "\$\\frac{x}{y}\$" |> print
$\frac{x}{y}$
Not doing that will either give an error:
julia> "$\frac{x}{y}$" |> print
ERROR: syntax: invalid interpolation syntax: "$\"
or invalid characters in the resulting strings:
julia> "``\e^x``" |> print
``^x``
Having to escape everything all the time would, of course, be annoying when writing docstrings. So, to get around this, as David pointed out out, you can use the doc string macro / non-standard string literal. In a doc literal all standard escape sequences are ignored, so an unescaped LaTeX string doesn't cause any issues:
julia> doc"$\frac{x}{y}$" |> print
$$
\frac{x}{y}
$$
Note: doc also parses the Markdown in the string and actually returns a Base.Markdown.MD object, not a string, which is why the printed string is a bit different from the input.
Finally, you can then use these doc-literals as normal docstrings, but you can then freely use LaTeX syntax without having to worry about escaping everything:
doc"""
$\frac{x}{y}$
"""
function foo end
This is also documented in Documenter's manual, although it is not actually specific to Documenter.
Double backticks vs dollar signs. The preferred way to mark LaTeX in Julia docstrings or documentation is by using double backticks or ```math blocks, as documented in the manual. Dollar signs are supported for backwards compatibility.
Note: Documenter's manual and the show methods for Markdown objects in Julia should be updated to reflect this.
You can use
doc"""
...
"""
This is a "non-standard string literal" used by the Documenter.jl package; see https://docs.julialang.org/en/stable/manual/strings/#non-standard-string-literals.

SQLite source code parse.y - nm

I am reading the grammar of SQLite and having a few questions about the following paragraph.
// The name of a column or table can be any of the following:
//
%type nm {Token}
nm(A) ::= id(A).
nm(A) ::= STRING(A).
nm(A) ::= JOIN_KW(A).
The nm has been used quite widely in the program. The lemon parser documentation said
Typically the data type of a non-terminal is a pointer to the root of
a parse-tree structure that contains all information about that
non-terminal
%type expr {Expr*}
Should I understand {Token} actually stands for a syntactic grouping which is a non-terminal token that "is a parse-tree structure that contains all.."?
What is nm short for in this same, is it simply "name"?
What is the period sign (dot .) that each nm(A) declaration ends up with?
No, you should understand that Token is a C object type used for the semantic value of nms.
(It is defined in sqliteInt.h and consists of a pointer to a non-null terminated character array and the length of that array.)
The comment immediately above the definition of nm starts with the words "the name", which definitely suggests to me that nm is an abbreviation for "name", yes. That is also consistent with its semantic type, as above, which is basically a name (or at least a string of characters).
All lemon productions end with a dot. It tells lemon where the end of the production is, like semicolons​ indicate to a C compiler where the end of a statement is. This makes it easier to parse consecutive productions, since otherwise the parser would have to look several symbols ahead to see the ::=

convert comment string to an ASCII character list in sicstus-prolog

currently I am working on comparison between SICStus3 and SICStus4 but I got one issue that is SICStus4 will not consult any cases where the comment string has carriage controls or tab characters etc as given below.
Example case as given below.It has 3 arguments with comma delimiter.
case('pr_ua_sfochi',"
Response:
answer(amount(2370.09,usd),[[01AUG06SFO UA CHI Q9.30 1085.58FUA2SFS UA SFO Q9.30 1085.58FUA2SFS NUC2189.76END ROE1.0 XT USD 180.33 ZPSFOCHI 164.23US6.60ZP5.00AY XF4.50SFO4.5]],amount(2189.76,usd),amount(2189.76,usd),amount(180.33,usd),[[fua2sfs,fua2sfs]],amount(6.6,usd),amount(4.5,usd),amount(0.0,usd),amount(18.6,usd),lasttktdate([20061002]),lastdateafterres(200712282]),[[fic_ticketinfo(fare(fua2sfs),fic([]),nvb([]),nva([]),tktiss([]),penalty([]),tktendorsement([]),tourinfo([]),infomsgs([])),fic_ticketinfo(fare(fua2sfs),fic([]),nvb([]),nva([]),tktiss([]),penalty([]),tktendorsement([]),tourinfo([]),infomsgs([]))]],<>,<>,cat35(cat35info([])))
.
02/20/2006 17:05:10 Transaction 35 served by static.static.server1 (usclsefat002:7551) running E*Fare version $Name: build-2006-02-19-1900 $
",price(pnr(
user('atl','1y',<>,<>,dept(<>,'0005300'),<>,<>,<>),
[
passenger(adt,1,[ptconly(n)])
],
[
segment(1,sfo,chi,'ua','<>','100',20140901,0800,f,20140901,2100,'737',res(20140628,1316),hk,pf2(n,[],[],n),<>,flags(no,no,no,no,no,no,no,no,no)),
segment(2,chi,sfo,'ua','<>','101',20140906,1000,f,20140906,1400,'737',res(20140628,1316),hk,pf2(n,[],[],n),<>,flags(no,no,no,no,no,no,no,no,no))
]),[
rebook(n),
ticket(20140301,131659),
dbaccess(20140301,131659),
platingcarrier('ua'),
tax_exempt([]),
trapparm("trap:ffil"),
city(y)
])).
The below predicate will remove comment section in above case.
flatten-cases :-
getmessage(M1),
write_flattened_case(M1),
flatten-cases.
flatten-cases.
write_flattened_case(M1):-
M1 = case(Case,_Comment,Entry),!,
M2 = case(Case,Entry),
writeq(M2),write('.'),nl.
getmessage(M) :-
read(M),
!,
M \== end_of_file.
:- flatten-cases.
Now my requirement is to convert the comment string to an ASCII character list.
Layout characters other than a regular space cannot occur literally in a quoted atom or a double quoted list. This is a requirement of the ISO standard and is fully implemented in SICStus since 3.9.0 invoking SICStus 3 with the option --iso. Since SICStus 4 only ISO syntax is supported.
You need to insert \n and \t accordingly. So instead of
log('Response:
yes'). % BAD!
Now write
log('Response:\n\tyes').
Or, to make it better readable use a continuation escape sequence:
log('Response:\n\
\tyes').
Note that using literal tabs and literal newlines is highly problematic. On a printout you do not see them! Think of 'A \nB' which would not show the trailing spaces nor trailing tabs.
But there are also many other situations like: Making a screenshot of program text, making a photo of program text, using a 3270 terminal emulator and copying the output. In the past, punched cards. The text-mode when reading files (which was originally motivated by punched cards). Similar arguments hold for the tabulator which comes from typewriters with their manually settable tab stops.
And then on SO it is quite difficult to type in a TAB. The browser refuses to type it (very wisely), and if you copy it in, you get it rendered as spaces.
If I am at it, there is also another problem. The name flatten-case should rather be written flatten_case.

HTML and XML Parsing in Fortran

I am studying mathematical computation and I am completely stuck on this task! I don't even know how to go about starting it!
**Write a program in Fortran that can parse a single line of well-formed HTML or XML markup so that it takes input on a single line (guaranteed to not exceed 80 characters in total) like
-lots of lovely text
where
tag might be anything from 1 to 37 ASCII characters and will not contain spaces
text could contain spaces and be anything from 1 to 73 characters in length
so that the program outputs one of two lines:
tag : text if the two occurrences of tag match inside <...> and
syntax error if anything else is input.
Any help is hugely appreciated !**
There are a number of intrinsic functions for working with strings that may be helpful.
result = index(string, substring) - returns the position of the start of the first occurrence of string substring as a substring in string, counting from one. (Fortran 77)
result = scan(string, set) - scans a string for any of the characters in a set of characters. (Fortran 95)
result = verify(string, set) - verifies that all the characters in a string are present in a set. (Fortran 95)
There are a few user-contributed string tokenization functions on the Fortran Wiki that might be helpful:
delim, strtok, and find_field. Also, FLIBS includes some string manipulation and tokenization routines that might be useful as examples.
Finally, there are a number of existing open-source XML parsers written in Fortran: xmlf90 and xml-fortran. Looking at the source code for these libraries should be helpful.

Resources