Nearley at least one - nearley

I have a grammar where I want to have some whitespace (including newlines) in between two terms. There should be some whitespace, i.e. it should fail if the two terms are touching, however there can be as much whitespace as desired. The issue I'm coming across is that whitespace and newlines are different tokens. I can't work out how to generally make "at least one" in nearley.

I managed to solve this with EBNF modifiers:
ws -> %WS | %NL
# At least one whitespace
someWS -> ws:+
# none or some whitespace
manyWS -> ws:*

If you are using moo.js, than you can predefine whitespace with regular expressions.
#{%
const moo = require('moo')
let lexer = moo.compile({
space: {match: /\s+/, lineBreaks: true}
// other rules
});
%}
#lexer lexer
_ -> null | %space {% d => null %} // any amount of white space or none
__ -> %space {% d => " " %} // at least one white space token

Related

Code for #into: is related to the code for #inject:into: , but must be solving a different problem. Explain the difference?

What does this do, and is there a simpler way to write it?
Collection>>into: a2block
| all pair |
all := ([:allIn :each| allIn key key: allIn value. each])
-> (pair := nil -> a2block).
pair key: all.
self inject: all
into: [:allIn :each| allIn value: (allIn key value: allIn value value: each). allIn].
^all value
If you want to find out, the best way is to try it, possibly step by step thru debugger.
For example, try:
(1 to: 4) into: #+.
(1 to: 4) into: #*.
You'll see a form of reduction.
So it's like inject:into: but without injecting anything.
The a2block will consume first two elements, then result of this and 3rd element, etc...
So we must understand the selector name as inject:into: without inject:: like the implementation, it's already a crooked way to name things.
This is implemented in Squeak as reduce:, except that reduce: would be well too obvious as a selector, and except that it would raise an Error if sent to an empty collection, contrarily to this which will answer the strange recursive artifact.
The all artifact contains a block to be executed for reduction as key, and current value of reduction as value. At first step, it arranges to replace the block to be executed with a2block, and the first value by first element. But the recursive structure used to achieve the replacement is... un-necessarily complex.
A bit less obfuscated way to write the same thing would be:
into: a2block
| block |
self ifEmpty: [self error: 'receiver of into: shall not be empty'].
block := [:newBlock :each| block := newBlock. each].
^self inject: a2block into: [:accum :each| block value: accum value: each]
That's more or less the same principle: at first iteration, the block is replaced by the reduction block (a2block), and first element is accumulated.
The reduction only begins at 2nd iteration.
The alternative is to check for first iteration inside the loop... One way to do it:
into: a2block
| first |
self ifEmpty: [self error: 'receiver of into: shall not be empty'].
first := Object new.
^self inject: first into: [:accum :each |
accum == first
ifTrue: [each]
ifFalse: [a2block value: accum value: each]].
There are many other ways to write it, more explicitely using a boolean to test for first iteration...

How can I delete all keys that don't match certain names with JQ?

I have a huge JSON file with lots of stuff I don't care about, and I want to filter it down to only the few keys I care about, preserving the structure. I won't bother if the same key name might occur in different paths and I get both of them. I gleaned something very close from the answers to this question, it taught me how to delete all properties with certain values, like all null values:
del(..|nulls)
or, more powerfully
del(..|select(. == null))
I searched high and low if I could write a predicate over the name of a property when I am looking at a property. I come from XSLT where I could write something like this:
del(..|select(name|test("^(foo|bar)$")))
where name/1 would be the function that returns the property name or array index number where the current value comes from. But it seems that jq lacks the metadata on its values, so you can only write predicates about their value, and perhaps the type of their value (that's still just a feature of the value), but you cannot inspect the name, or path leading up to it?
I tried to use paths and leaf_paths and stuff like that, but I have no clue what that would do and tested it out to see how this path stuff works, but it seems to find child paths inside an object, not the path leading up to the present value.
So how could this be done, delete everything but a set of key values? I might have found a way here:
walk(
if type == "object" then
with_entries(
select( ( .key |test("^(foo|bar|...)$") )
and ( .value != "" )
and ( .value != null ) )
)
else
.
end
)
OK, this seems to work. But I still wonder it would be so much easier if we had a way of querying the current property name, array index, or path leading up to the present item being inspected with the simple recusion ..| form.
In analogy to your approach using .. and del, you could use paths and delpaths to operate on a stream of path arrays, and delete a given path if not all of its elements meet your conditions.
delpaths([paths | select(all(IN("foo", "bar") or type == "number") | not)])
For the condition I used IN("foo", "bar") but (type == "string" and test("^(foo|bar)$")) would work as well. To also retain array elements (which have numeric indices), I added or type == "number".
Unlike in XML, there's no concept of attributes in jq. You'll need to delete from objects.
To delete an element of an object, you need to use del( obj[ key ] ) (or use with_entries). You can get a stream of the keys of an object using keys[]/keys_unsorted[] and filter out the ones you don't want to delete.
Finally, you need to invert the result of test because you want to delete those that don't match.
After fixing these problems, we get the following:
INDEX( "foo", "bar" ) as $keep |
del(
.. | objects |
.[
keys_unsorted[] |
select( $keep[ . ] | not )
]
)
Demo on jqplay
Note that I substituted the regex match with a dictionary lookup. You could use test( "^(?:foo|bar)\\z" ) in lieu of $keep[ . ], but a dictionary lookup should be faster than a regex match. And it should be less error-prone too, considering you misused $ and (...) in lieu of \z and (?:...).
The above visits deleted branches for nothing. We can avoid that by using walk instead of ...
INDEX( "foo", "bar" ) as $keep |
walk(
if type == "object" then
del(
.[
keys_unsorted[] |
select( $keep[ . ] | not )
]
)
else
.
end
)
Demo on jqplay
Since I mentioned one could use with_entries instead of del, I'll demonstrate.
INDEX( "foo", "bar" ) as $keep |
walk(
if type == "object" then
with_entries( select( $keep[ .key ] ) )
else
.
end
)
Demo on jqplay
Here's a solution that uses a specialized variant of walk for efficiency (*). It retains objects all keys of which are removed; only trivial changes are needed if a blacklist or some other criterion (e.g., regexp-based) is given instead. WHITELIST should be a JSON array of the key names to be retained.
jq --argjson whitelist WHITELIST '
def retainKeys($array):
INDEX($array[]; .) as $keys
| def r:
if type == "object"
then with_entries( select($keys[.key]) )
| map_values( r )
elif type == "array" then map( r )
else .
end;
r;
retainKeys($whitelist)
' input.json
(*) Note for example:
the use of INDEX
the recursive function, r, has arity 0
for objects, the top-level deletion occurs first.
Here's a space-efficient, walk-free approach, tailored for the case of a WHITELIST. It uses the so-called "streaming" parser, so the invocation would look like this:
jq -n --stream --argjson whitelist WHITELIST -f program.jq input.json
where WHITELIST is a JSON array of the names of the keys to be deleted, and
where program.jq is a file containing the program:
# Input: an array
# Output: the longest head of the array that includes only numbers or items in the dictionary
def acceptable($dict):
last(label $out
| foreach .[] as $x ([];
if ($x|type == "number") or $dict[$x] then . + [$x]
else ., break $out
end));
INDEX( $whitelist[]; .) as $dict
| fromstream(inputs
| if length==2
then (.[0] | acceptable($dict)) as $p
| if ($p|length) == (.[0]|length) - 1 then .[0] = $p | .[1] = {}
elif ($p|length) < (.[0]|length) then empty
else .
end
else .
end )
Note: The reason this is relatively complicated is that it assumes that you want to retain objects all of whose keys have been removed, as illustrated in the following example. If that is not the case, then the required jq program is much simpler.
Example:
WHITELIST: '["items", "config", "spec", "setting2", "name"]'
input.json:
{
"items": [
{
"name": "issue1",
"spec": {
"config": {
"setting1": "abc",
"setting2": {
"name": "xyz"
}
},
"files": {
"name": "cde",
"path": "/home"
},
"program": {
"name": "apache"
}
}
},
{
"name": {
"etc": 0
}
}
]
}
Output:
{
"items": [
{
"name": "issue1",
"spec": {
"config": {
"setting2": {
"name": "xyz"
}
}
}
},
{
"name": {}
}
]
}
I am going to put my own tentative answer here.
The thing is, the solution I had already in my question, meaning I can select keys during forward navigation, but I cannot find out the path leading up to the present value.
I looked around in the source code of jq to see how come we cannot inquire the path leading up to the present value, so we could ask for the key string or array index of the present value. And indeed it looks like jq does not track the path while it walks through the input structure.
I think this is actually a huge opportunity forfeited that could be so easily kept track during the tree walk.
This is why I continue thinking that XML with XSLT and XPath is a much more robust data representation and tool chain than JSON. In fact, I find JSON harder to read even than XML. The benefit of the JSON being so close to javascript is really only relevant if - as I do in some cases - I read the JSON as a javascript source code assigning it to a variable, and then instrument it by changing the prototype of the anonymous JSON object so that I have methods to go with them. But changing the prototype is said to cause slowness. Though I don't think it does when setting it for otherwise anonymous JSON objects.
There is JsonPath that tries (by way of the name) to be something like what XPath is for XML. But it is a poor substitute and also has no way to navigate up the parent (or then sibling) axes.
So, in summary, while selecting by key in white or black lists is possible in principle, it is quite hard, because a pretty easy to have feature of a JSON navigation language is not specified and not implemented. Other useful features that could be easily achieved in jq is backward navigation to parent or ancestor of the present value. Currently, if you want to navigate back, you need to capture the ancestor you want to get back to as a variable. It is possible, but jq could be massively improved by keeping track of ancestors and paths.

Count number of occurences of a character in an element using xquery

I have a variable which has | separated values like below.
I need to make sure it never has more than 30 sequences separated by '|', so i believe if i count number of occurrences of '|' in the var it would suffice
class=1111|2222|3333|4444
Can you please help in writing xquery for the same.
I am new to xquery.
If you remove all characters but the bar and then use string-length as in let $s := '1111|2222|3333|4444' return string-length(translate($s, translate($s, '|', ''), '')) you get the number of | characters. That use of string-length and the double translate to remove anything but a certain character is an old XPath 1 trick, of course as XQuery also has replace you could as well use let $s := '1111|2222|3333|4444' return string-length(replace($s, '[^|]+', '')).
You could use the tokenize() function to split the value by the | character, and then count how many items in the sequence with fn:count().
Just remember that the tokenize function uses a regex pattern, so you would need to escape the | as \|:
let $PSV := "1111|2222|3333|4444"
let $tokens := fn:tokenize($PSV, "\|")
let $token-count := fn:count($tokens)
return
if ($token-count > 30) then
fn:error((), "Too many pipe separated values")
else
(: less than thirty values, do stuff with the $tokens :)
()
Just for good measure, and in case you want to do any performance comparisons, you could try
let $sep := string-to-codepoints('|')
return count(string-to-codepoints($in)[.=$sep])
This has the theoretical advantage that (at least in Saxon) it doesn't construct any new strings or sequences in memory.

Find a specific tuple by key in an Erlang list (eJabberd HTTP Header)

I am just getting started with eJabberd and am writing a custom module with HTTP access.
I have the request going through, but am now trying to retrieve a custom header and that's where I'm having problems.
I've used the Request record to get the request_headers list and can see that it contains all of the headers I need (although the one I'm after is a binary string on both the key and value for some reason...) as follows:
[
{ 'Content-Length', <<"100">> },
{ <<"X-Custom-Header">>, <<"CustomValue">> },
{ 'Host', <<"127.0.0.1:5280">> },
{ 'Content-Type', <<"application/json">> },
{ 'User-Agent', <<"Fiddler">> }
]
This is also my first foray into functional programming, so from procedural perspective, I would loop through the list and check if the key is the one that I'm looking for and return the value.
To this end, I've created a function as:
find_header(HeaderKey, Headers) ->
lists:foreach(
fun(H) ->
if
H = {HeaderKey, Value} -> H;
true -> false
end
end,
Headers).
With this I get the error:
illegal guard expression
I'm not even sure I'm going about this the right way so am looking for some advice as to how to handle this sort of scenario in Erlang (and possibly in functional languages in general).
Thanks in advance for any help and advice!
PhilHalf
The List that you have mentioned is called a "Property list", which is an ordinary list containing entries in the form of either tuples, whose first elements are keys used for lookup and insertion or atoms, which work as shorthand for tuples {Atom, true}.
To get a value of key, you may do the following:
proplists:get_value(Key,List).
for Example to get the Content Length:
7> List=[{'Content-Length',<<"100">>},
{<<"X-Custom-Header">>,<<"CustomValue">>},
{'Host',<<"127.0.0.1:5280">>},
{'Content-Type',<<"application/json">>},
{'User-Agent',<<"Fiddler">>}].
7> proplists:get_value('Content-Type',List).
<<"application/json">>
You can use the function lists:keyfind/3:
> {_, Value} = lists:keyfind('Content-Length', 1, Headers).
{'Content-Length',<<"100">>}
> Value.
<<"100">>
The 1 in the second argument tells the function what tuple element to compare. If, for example, you wanted to know what key corresponds to a value you already know, you'd use 2 instead:
> {Key, _} = lists:keyfind(<<"100">>, 2, Headers).
{'Content-Length',<<"100">>}
> Key.
'Content-Length'
As for how to implement this in Erlang, you'd write a recursive function.
Imagine that you're looking at the first element of the list, trying to figure out if this is the entry you're looking for. There are three possibilities:
The list is empty, so there is nothing to compare.
The first entry matches. Return it and ignore the rest of the list.
The first entry doesn't match. Therefore, the result of looking for this key in this list is the same as the result of looking for it in the remaining elements: we recurse.
find_header(_HeaderKey, []) ->
not_found;
find_header(HeaderKey, [{HeaderKey, Value} | _Rest]) ->
{ok, Value};
find_header(HeaderKey, [{_Key, _Value} | Rest]) ->
find_header(HeaderKey, Rest).
Hope this helps.

Simultaneous match in FParsec

If I'm trying to parse the following into lines and fields. Lines are delimited by '\n' and fields are delimited by '|'.
abcd|efgh|ijkl
mnopq\|rst|uvwxy
za|bcd
efg|hijk|lmnop
I can define the following:
let displayCharacter = satisfy (fun c -> ' ' <= c && c <= '~')
let escapedDC = pchar '\\' >>. displayCharacter
let test1 =
run (manyChars (escapedDC <|> displayCharacter)) "asdf\|efgh|ijkl"
// Success: "asdf|efgh|ijkl"
But let fields = sepBy (manyChars (escapedDC <|> displayCharacter)) (pchar '|') cannot work to exclude the '|' from the field. These delimiters are context sensitive, so I want to avoid hard-coding them into displayCharacter since '|' is a display character, but just might need escaping in certain contexts.
If I try to define a single field with manyCharsTill, then I need to account for the final element on a line with anyOf "|\n", but this reads in all of the lines into one line.
I may have further subdelimiters beyond '|' that are supported in certain contexts. For this reason, it seems messy to have to define versions of displayCharacter and escapedDC for every case. Rather, using lookahead features seems cleaner. Or perhaps a parser called both which somehow requires a match on two parsers simultaneously.
manyCharsSepBy (escapedDC <|> displayCharacter) (pchar '|')
or
let contextualDisplayCharacter1 = both displayCharacter (satisfy ((<>) '|'))
Is there an easier way to accomplish this? Perhaps it is just my implied BNF that is flawed - that if fixed, would translate easily?
============
This is the best I can come up with, but I would like to know from the experts if it is the most flexible way.
let displayCharacter (excludeDelimiters : string) = satisfy (fun c -> ' ' <= c && c <= '~' && not (Seq.exists ((=) c) excludeDelimiters))
let escapedDisplayCharacter = pchar '\\' >>. displayCharacter ""
let field =
manyChars (escapedDisplayCharacter <|> displayCharacter "|")

Resources