How to find double quotation marks in ML

How to find double quotation marks in ML - functional-programming

I have this code which finds double quotation marks and converts the inside of those quotation marks into a string. It manages to find the first quotation mark but fails to find the second so: "this" would be "this . How do I get it I can get this function to find the full string.

Maybe this is too obvious:
if (ch = #"\"") then SOME(String(x ^ "\""))

I do not really understand your code: you return the string just after the first occurence of the quotation mark, but this string has been built with the characters that you've found before it. Moreover, why do you return SOME(Error) instead of NONE?
You need to use a boolean variable to know when the first quotation mark has been seen and to stop when the second one is found. So I would write something like this:
fun parseString x inStr quote =
case (TextIO.input1 inStr, quote) of
(NONE, _) => NONE
| (SOME #"\"", true) => SOME x
| (SOME #"\"", false) => parseString x inStr true
| (SOME ch, true) => parseString (x ^ (String.str ch)) inStr quote
| (SOME _ , false) => parseString x inStr quote;
and initialize quote with false.

Related

Why does my Prolog S-expression tokenizer fail on its base case?

To learn some Prolog (I'm using GNU Prolog) and grok its parsing abilities, I am starting by writing a Lisp (or S-expression, if I'm being exact) tokenizer, which given a set of tokens like ['(', 'f', 'o', 'o', ')'] should produce ['(', 'foo', ')']. It's not working as expected, which is why I'm here! I thought my thought process shined through in my pseudocode:
tokenize([current | rest], buffer, tokens):
if current is '(' or ')',
Tokenize the rest,
And the output will be the current token buffer,
Plus the parenthesis and the rest.
if current is ' ',
Tokenize the rest with a clean buffer,
And the output will be the buffer plus the rest.
if the tail is empty,
The output will be a one-element list containing the buffer.
otherwise,
Add the current character to the buffer,
And the output will be the rest tokenized, with a bigger buffer.
I translated that to Prolog like this:
tokenize([Char | Chars], Buffer, Tokens) :-
((Char = '(' ; Char = ')') ->
tokenize(Chars, '', Tail_Tokens),
Tokens is [Buffer, Char | Tail_Tokens];
Char = ' ' ->
tokenize(Chars, '', Tail_Tokens),
Tokens is [Buffer | Tail_Tokens];
Chars = [] -> Tokens is [Buffer];
atom_concat(Buffer, Char, New_Buffer),
tokenize(Chars, New_Buffer, Tokens)).
print_tokens([]) :- write('.').
print_tokens([T | N]) :- write(T), write(', '), print_tokens(N).
main :-
% tokenize(['(', 'f', 'o', 'o', '(', 'b', 'a', 'r', ')', 'b', 'a', 'z', ')'], '', Tokens),
tokenize(['(', 'f', 'o', 'o', ')'], '', Tokens),
print_tokens(Tokens).
When running the result, below, like this: gprolog --consult-file lisp_parser.pl it just tells me no. I traced main, and it gave me the stack trace below. I do not understand why tokenize fails for the empty case. I see that the buffer is empty since it was cleared with the previous ')', but even if Tokens is empty at that point in time, wouldn't Tokens accumulate a larger result recursively? Can someone who is good with Prolog give me a few tips here?
| ?- main.
no
| ?- trace.
The debugger will first creep -- showing everything (trace)
(1 ms) yes
{trace}
| ?- main.
1 1 Call: main ?
2 2 Call: tokenize(['(',f,o,o,')'],'',_353) ?
3 3 Call: tokenize([f,o,o,')'],'',_378) ?
4 4 Call: atom_concat('',f,_403) ?
4 4 Exit: atom_concat('',f,f) ?
5 4 Call: tokenize([o,o,')'],f,_429) ?
6 5 Call: atom_concat(f,o,_454) ?
6 5 Exit: atom_concat(f,o,fo) ?
7 5 Call: tokenize([o,')'],fo,_480) ?
8 6 Call: atom_concat(fo,o,_505) ?
8 6 Exit: atom_concat(fo,o,foo) ?
9 6 Call: tokenize([')'],foo,_531) ?
10 7 Call: tokenize([],'',_556) ?
10 7 Fail: tokenize([],'',_544) ?
9 6 Fail: tokenize([')'],foo,_519) ?
7 5 Fail: tokenize([o,')'],fo,_468) ?
5 4 Fail: tokenize([o,o,')'],f,_417) ?
3 3 Fail: tokenize([f,o,o,')'],'',_366) ?
2 2 Fail: tokenize(['(',f,o,o,')'],'',_341) ?
1 1 Fail: main ?
(1 ms) no
{trace}
| ?-

How about this. I think that's what you want to do, but let's use Definite Clause Grammars (which are just horn clauses with :- replaced by --> and two elided arguments holding the input character list and remaining character list. An example DCG rule:
rule(X) --> [c], another_rule(X), {predicate(X)}.
List processing rule rule//1 says: When you find character c in the input list, then continue list processing with another_rule//1, and when that worked out, call predicate(X) as normal.
Then:
% If we encounter a separator symbol '(' or ')', we commit to the
% clause using '!' (no point trying anything else, in particular
% not the clause for "other characters", tokenize the rest of the list,
% and when we have done that decide whether 'MaybeToken', which is
% "part of the leftmost token after '(' or ')'", should be retained.
% it is dropped if it is empty. The caller is then given an empty
% "part of the leftmost token" and the list of tokens, with '(' or ')'
% prepended: "tokenize('', [ '(' | MoreTokens] ) -->"
tokenize('', [ '(' | MoreTokens] ) -->
['('],
!,
tokenize(MaybeToken,Tokens),
{drop_empty(MaybeToken,Tokens,MoreTokens)}.
tokenize('',[')'|MoreTokens]) -->
[')'],
!,
tokenize(MaybeToken,Tokens),
{drop_empty(MaybeToken,Tokens,MoreTokens)}.
% No more characters in the input list (that's what '--> []' says).
% We succeed, with an empty token list and an empty buffer fro the
% leftmost token.
tokenize('',[]) --> [].
% If we find a 'Ch' that is not '(' or ')', then tokenize
% more of the list via 'tokenize(MaybeToken,Tokens)'. On
% returns 'MaybeToken' is a piece of the leftmost token found
% in that list, so we have to stick 'Ch' onto its start.
tokenize(LargerMaybeToken,Tokens) -->
[Ch],
tokenize(MaybeToken,Tokens),
{atom_concat(Ch,MaybeToken,LargerMaybeToken)}.
% ---
% This drops an empty "MaybeToken". If "MaybeToken" is
% *not* empty, it is actually a token and prepended to the list "Tokens"
% ---
drop_empty('',Tokens,Tokens) :- !.
drop_empty(MaybeToken,Tokens,[MaybeToken|Tokens]).
% -----------------
% Call the DCG using phrase/2
% -----------------
tokenize(Text,Result) :-
phrase( tokenize(MaybeToken,Tokens), Text ),
drop_empty(MaybeToken,Tokens,Result),!.
And so:
?- tokenize([h,e,l,l,o],R).
R = [hello].
?- tokenize([h,e,l,'(',l,')',o],R).
R = [hel,(,l,),o].
?- tokenize([h,e,l,'(',l,l,')',o],R).
R = [hel,(,ll,),o].
I think in GNU Prolog, the notation `hello` generates [h,e,l,l,o] directly.

I do not understand why tokenize fails for the empty case.
The reason anything fails in Prolog is because there is no clause that makes it true. If your only clause for tokenize is of the form tokenize([Char | Chars], ...), then no call of the form tokenize([], ...) will ever be able to match this clause, and since there are no other clauses, the call will fail.
So you need to add such a clause. But first:
:- set_prolog_flag(double_quotes, chars).
This allows you to write ['(', f, o, o, ')'] as "foo".
Also, you must plan for the case where the input is completely empty, or other cases where you must maybe emit a token for the buffer, but only if it is not '' (since there should be no '' tokens littering the result).
finish_buffer(Tokens, Buffer, TokensMaybeWithBuffer) :-
( Buffer = ''
-> TokensMaybeWithBuffer = Tokens
; TokensMaybeWithBuffer = [Buffer | Tokens] ).
For example:
?- finish_buffer(MyTokens, '', TokensMaybeWithBuffer).
MyTokens = TokensMaybeWithBuffer.
?- finish_buffer(MyTokens, 'foo', TokensMaybeWithBuffer).
TokensMaybeWithBuffer = [foo|MyTokens].
Note that you can prepend the buffer to the list of tokens, even if you don't yet know what that list of tokens is! This is the power of logical variables. The rest of the code uses this technique as well.
So, the case for the empty input:
tokenize([], Buffer, Tokens) :-
finish_buffer([], Buffer, Tokens).
For example:
?- tokenize([], '', Tokens).
Tokens = [].
?- tokenize([], 'foo', Tokens).
Tokens = [foo].
And the remaining cases:
tokenize([Parenthesis | Chars], Buffer, TokensWithParenthesis) :-
( Parenthesis = '('
; Parenthesis = ')' ),
finish_buffer([Parenthesis | Tokens], Buffer, TokensWithParenthesis),
tokenize(Chars, '', Tokens).
tokenize([' ' | Chars], Buffer, TokensWithBuffer) :-
finish_buffer(Tokens, Buffer, TokensWithBuffer),
tokenize(Chars, '', Tokens).
tokenize([Char | Chars], Buffer, Tokens) :-
Char \= '(',
Char \= ')',
Char \= ' ',
atom_concat(Buffer, Char, NewBuffer),
tokenize(Chars, NewBuffer, Tokens).
Note how I used separate clauses for the separate cases. This makes the code more readable, but it does have the drawback compared to (... -> ... ; ...) that the last clause must exclude characters handled by previous clauses. Once you have your code in this shape, and you're happy that it works, you can transform it into a form using (... -> ... ; ...) if you really want to.
Examples:
?- tokenize("(foo)", '', Tokens).
Tokens = ['(', foo, ')'] ;
false.
?- tokenize(" (foo)", '', Tokens).
Tokens = ['(', foo, ')'] ;
false.
?- tokenize("(foo(bar)baz)", '', Tokens).
Tokens = ['(', foo, '(', bar, ')', baz, ')'] ;
false.
Finally, and very importantly, the is operator is meant only for evaluation of arithmetic expressions. It will throw an exception when you apply it to anything that is not arithmetic. Unification is different from the evaluation of arithmetic expression. Unification is written as =.
?- X is 2 + 2.
X = 4.
?- X = 2 + 2.
X = 2+2.
?- X is [a, b, c].
ERROR: Arithmetic: `[a,b,c]' is not a function
ERROR: In:
ERROR: [20] throw(error(type_error(evaluable,...),_3362))
ERROR: [17] arithmetic:expand_function([a,b|...],_3400,_3402) at /usr/lib/swi-prolog/library/arithmetic.pl:175
ERROR: [16] arithmetic:math_goal_expansion(_3450 is [a|...],_3446) at /usr/lib/swi-prolog/library/arithmetic.pl:147
ERROR: [14] '$expand':call_goal_expansion([system- ...],_3512 is [a|...],_3492,_3494,_3496) at /usr/lib/swi-prolog/boot/expand.pl:863
ERROR: [13] '$expand':expand_goal(_3566 is [a|...],_3552,_3554,_3556,user,[system- ...],_3562) at /usr/lib/swi-prolog/boot/expand.pl:524
ERROR: [12] setup_call_catcher_cleanup('$expand':'$set_source_module'(user,user),'$expand':expand_goal(...,_3640,_3642,_3644,user,...,_3650),_3614,'$expand':'$set_source_module'(user)) at /usr/lib/swi-prolog/boot/init.pl:443
ERROR: [8] '$expand':expand_goal(user:(_3706 is ...),_3692,user:_3714,_3696) at /usr/lib/swi-prolog/boot/expand.pl:458
ERROR: [6] setup_call_catcher_cleanup('$toplevel':'$set_source_module'(user,user),'$toplevel':expand_goal(...,...),_3742,'$toplevel':'$set_source_module'(user)) at /usr/lib/swi-prolog/boot/init.pl:443
ERROR:
ERROR: Note: some frames are missing due to last-call optimization.
ERROR: Re-run your program in debug mode (:- debug.) to get more detail.
^ Call: (14) call('$expand':'$set_source_module'(user)) ? abort
% Execution Aborted
?- X = [a, b, c].
X = [a, b, c].

Is there a way to display this only once?

I wrote this sml function that allows me to display the first 5 columns of the Ascii table.
fun radix (n, base) =
let
val b = size base
val digit = fn n => str (String.sub (base, n))
val radix' =
fn (true, n) => digit n
| (false, n) => radix (n div b, base) ^ digit (n mod b)
in
radix' (n < b, n)
end;
val n = 255;
val charList = List.tabulate(n+1,
fn x => print(
"DEC"^"\t"^"OCT"^"\t"^"HEX"^"\t"^"BIN"^"\t"^"Symbol"^"\n"^
Int.toString(x)^"\t"^
radix (x, "01234567")^"\t"^
radix (x, "0123456789abcdef")^"\t"^
radix (x, "01")^"\t"^
Char.toCString(chr(x))^"\t"
)
);
But I want the header : "DEC"^"\t"^"OCT"^"\t"^"HEX"^"\t"^"BIN"^"\t"^"Symbol" to be displayed only once at the beginning, but I can't do it. Does anyone know a way to do it?
On the other hand I would like to do without the resursive call of the "radix" function. Is that possible? And is it a wise way to write this function?

I want the header : "DEC"... to be displayed only once at the beginning
Currently the header displays multiple times because it is being printed inside of List.tabulate's function, once for each number in the table. So you can move printing the header outside of this function and into a parent function.
For clarity I might also move the printing of an individual character into a separate function. (I think you have indented the code in your charList very nicely, but if a function does more than one thing, it is doing too many things.)
E.g.
fun printChar (i : int) =
print (Int.toString i ^ ...)
fun printTable () =
( print "DEC\tOCT\tHEX\tBIN\tSymbol\n"
; List.tabulate (256, printChar)
; () (* why this? *)
)
It is very cool that you found Char.toCString which is safe compared to simply printing any character. It seems to give some pretty good names for e.g. \t and \n, but hardly for every function. So if you really want to spice up your table, you could add a helper function,
fun prettyName character =
if Char.isPrint character
then ...
else case ord character of
0 => "NUL (null)"
| 1 => "SOH (start of heading)"
| 2 => "STX (start of text)"
| ...
and use that instead of Char.toCString.
Whether to print a character itself or some description of it might be up to Char.isPrint.
I would like to do without the resursive call of the "radix" function.
Is that possible?
And is it a wise way to write this function?
You would need something equivalent to your radix function either way.
Sure, it seems okay. You could shorten it a bit, but the general approach is good.
You have avoided list recursion by doing String.sub constant lookups. That's great.

How do you access name of a ProtoField after declaration?

How can I access the name property of a ProtoField after I declare it?
For example, something along the lines of:
myproto = Proto("myproto", "My Proto")
myproto.fields.foo = ProtoField.int8("myproto.foo", "Foo", base.DEC)
print(myproto.fields.foo.name)
Where I get the output:
Foo

An alternate method that's a bit more terse:
local fieldString = tostring(field)
local i, j = string.find(fieldString, ": .* myproto")
print(string.sub(fieldString, i + 2, j - (1 + string.len("myproto")))
EDIT: Or an even simpler solution that works for any protocol:
local fieldString = tostring(field)
local i, j = string.find(fieldString, ": .* ")
print(string.sub(fieldString, i + 2, j - 1))
Of course the 2nd method only works as long as there are no spaces in the field name. Since that's not necessarily always going to be the case, the 1st method is more robust. Here is the 1st method wrapped up in a function that ought to be usable by any dissector:
-- The field is the field whose name you want to print.
-- The proto is the name of the relevant protocol
function printFieldName(field, protoStr)
local fieldString = tostring(field)
local i, j = string.find(fieldString, ": .* " .. protoStr)
print(string.sub(fieldString, i + 2, j - (1 + string.len(protoStr)))
end
... and here it is in use:
printFieldName(myproto.fields.foo, "myproto")
printFieldName(someproto.fields.bar, "someproto")

Ok, this is janky, and certainly not the 'right' way to do it, but it seems to work.
I discovered this after looking at the output of
print(tostring(myproto.fields.foo))
This seems to spit out the value of each of the members of ProtoField, but I couldn't figure out the correct way to access them. So, instead, I decided to parse the string. This function will return 'Foo', but could be adapted to return the other fields as well.
function getname(field)
--First, convert the field into a string
--this is going to result in a long string with
--a bunch of info we dont need
local fieldString= tostring(field)
-- fieldString looks like:
-- ProtoField(188403): Foo myproto.foo base.DEC 0000000000000000 00000000 (null)
--Split the string on '.' characters
a,b=fieldString:match"([^.]*).(.*)"
--Split the first half of the previous result (a) on ':' characters
a,b=a:match"([^.]*):(.*)"
--At this point, b will equal " Foo myproto"
--and we want to strip out that abreviation "abvr" part
--Count the number of times spaces occur in the string
local spaceCount = select(2, string.gsub(b, " ", ""))
--Declare a counter
local counter = 0
--Declare the name we are going to return
local constructedName = ''
--Step though each word in (b) separated by spaces
for word in b:gmatch("%w+") do
--If we hav reached the last space, go ahead and return
if counter == spaceCount-1 then
return constructedName
end
--Add the current word to our name
constructedName = constructedName .. word .. " "
--Increment counter
counter = counter+1
end
end

Julia metaprogramming return symbol

I'm trying to figure out how to have a quote block, when evaluated, return a symbol. See the example below.
function func(symbol::Symbol)
quote
z = $symbol
symbol
end
end
a = 1
eval(func(:a)) #this returns :symbol. I would like it to return :a
z

The symbol your function returned where the symbol function, due to the last symbol in your qoute did not have $ in front. The second problem is you would like to return the symbol it self, which requires you make a quote inside the quote similar to this question
Julia: How do I create a macro that returns its argument?
function func(s::Symbol)
quote
z = $s
$(Expr(:quote, s)) # This creates an expresion inside the quote
end
end
a = 1
eval(func(:a)) #this returns :a
z

How should I understand `let _ as s = "abc" in s ^ "def"` in OCaml?

let _ as s = "abc" in s ^ "def"
So how should understand this?
I guess it is some kind of let pattern = expression thing?
First, what's the meaning/purpose/logic of let pattern = expression?
Also, in pattern matching, I know there is pattern as identifier usage, in let _ as s = "abc" in s ^ "def", _ is pattern, but behind as, it is an expression s = "abc" in s ^ "def", not an identifier, right?
edit:
finally, how about this: (fun (1 | 2) as i -> i + 1) 2, is this correct?
I know it is wrong, but why? fun pattern -> expression is allowed, right?
I really got lost here.

The grouping is let (_ as s) = "abc" -- which is just a convoluted way of saying let s = "abc", because as with a wildcard pattern _ in front is pretty much useless.

The expression let pattern = expr1 in expr2 is pretty central to OCaml. If the pattern is just a name, it lets you name an expression. This is like a local variable in other language. If the pattern is more complicated, it lets you destructure expr1, i.e., it lets you give names to its components.
In your expression, behind as is just an identifier: s. I suspect your confusion all comes down to this one thing. The expression can be parenthesized as:
let (_ as s) = "abc" in s ^ "def"
as Andreas Rossberg shows.
Your final example is correct if you add some parentheses. The compiler/toplevel rightly complains that your function is partial; i.e., it doesn't know what to do with most ints,
only with 1 and 2.
Edit: here's a session that shows how to add the parentheses to your final example:
$ ocaml
OCaml version 4.00.0
# (fun (1 | 2) as i -> i + 1) 2;;
Error: Syntax error
# (fun ((1 | 2) as i) -> i + 1) 2;;
Warning 8: this pattern-matching is not exhaustive.
Here is an example of a value that is not matched:
0
- : int = 3
#
Edit 2: here's a session that shows how to remove the warning by specifying an exhaustive set of patterns.
$ ocaml
OCaml version 4.00.0
# (function ((1|2) as i) -> i + 1 | _ -> -1) 2;;
- : int = 3
# (function ((1|2) as i) -> i + 1 | _ -> -1) 3;;
- : int = -1
#