Pyparsing: grouping results by keywords ending with colons - pyparsing

I am new to pyparsing. Although did try to read through the docs, I did not manage to solve the first problem of grouping expressions by "keyword: token(s)". I ended up with this code:
import pyparsing
from pprint import pprint
token = pp.Word(pp.alphas)
keyword = pp.Combine(token + pp.Literal(":"))
expr = pp.Group(keyword[1] + token[1, ...])
pprint(expr.parse_string("keyA: aaa bb ccc keyB: ddd eee keyC: xxx yyy hhh zzz").as_list())
It stops in the middle and parses the second keyword as a regular token. The result is following:
the expression:
keyA: aaa bb ccc keyB: ddd eee keyC: xxx yyy hhh zzz
gets parsed into:
[['keyA:', 'aaa', 'bb', 'ccc', 'keyB']]
I cannot figure out how to define keyword and token.
Edit.
In general, I'd like to parse the following expression:
keyword1: token11 token12 ... keyword2: token21 & token22 token23 keyword3: (token31 token32) & token33
into the following list:
[
["keyword1:", "token11", "token12", ...],
["keyword2:", ["token21", "&", "token22"], "token23"],
["keyword3:", [["token31", "token32"], "&", "token33"]],
]

OK, so I was looking for a way to specify that token ends with an alphanumeric character, that is not :. It turns out Pyparsing has a function WordEnd(), which I used, with which the expression is correctly parsed.
import pyparsing
from pprint import pprint
token = pp.Combine(pp.Word(pp.alphas) + pp.WordEnd())
keyword = pp.Combine(pp.Word(pp.alphas) + pp.Literal(":"))
expr = pp.Group(keyword[1] + token[1, ...])[1, ...]
pprint(expr.parse_string("keyA: aaa bb ccc keyB: ddd eee keyC: xxx yyy hhh zzz").as_list())
[['keyA:', 'aaa', 'bb', 'ccc'],
['keyB:', 'ddd', 'eee'],
['keyC:', 'xxx', 'yyy', 'hhh', 'zzz']]

To add support for the '&' operator as in your original post, you were very close with your use of infixNotation. In your original, you had an expression like "a b & c", which you wanted to parse as ['a', ['b', '&', 'c']. The first issue you had was with the token vs. key issue, which you have resolved for yourself. The second issue has to do with your operators. It is possible in infixNotation to define an empty operator using a Python None for the operator expression. Since you have defined your expression to make this of lower precedence than '&', then you would define your expression as:
expr = infixNotation(token,
[
('&', 2, opAssoc.LEFT,),
(None, 2, opAssoc.LEFT,),
])
Use runTests to quickly run a bunch of tests:
expr.runTests("""\
a b c
a & b
a & b & c
a & b c
a & (b c)
""", fullDump=False)
Prints:
a b c
[['a', 'b', 'c']]
a & b
[['a', '&', 'b']]
a & b & c
[['a', '&', 'b', '&', 'c']]
a & b c
[[['a', '&', 'b'], 'c']]
a & (b c)
[['a', '&', ['b', 'c']]]

Related

Regular Expression in teradata

I need to search few patterns from a column using regular expression in Teradata.
One of the example is mentioned below:
SELECT
REGEXP_SUBSTR(
REGEXP_SUBSTR('1-2-3','([0-9] *- *[0-9] *- *[0-9])',1, 1, 'i'),
'([0-9] *- *[0-9] *- *[0-9])',
1, 1, 'i'
) AS Tmp,
REGEXP_SUBSTR(
tmp,
'(^[0-9])',1,1,'i') || '-' || REGEXP_SUBSTR(tmp,'([0-9]$)',
1, 1, 'i'
) AS final_exp
;
In the above expression, I am extracting "1-3" out of a pattern like "1-2-3". Now the patterns can be anything like: 1-2-3-4-5 or 1-2,3 or 1&2-3 or 1-2,3 &4.
Is there any way that I can generalize the search pattern in regular expression like [-,&]* will only search for occurrence of this characters in order, but the characters can be present in any order in the data.
Few examples mentioned below,need is to fetch all the desired result set using a single pattern serch in expression.
Column name ==> Result
abc 1-2+3- 4 ==> 1-4
def 10,12 & 13 ==> 10-13
ijk 1,2,3, and 4 lmn ==> 1-4
abc1-2 & 3 def ==> 1-3
ikl 11 &12 -13 ==> 11-13
oAy$ 7-8 and 9 ==> 7-9
RegExp_Substr(col, '(\d+)',1, 1, 'c') || '-' ||
RegExp_Substr(col, '(\d+)(?!.*\d)',1, 1, 'c')
(\d+) = first number
(\d+)(?!.*\d) = last number (a number not followed by another number)
There's also no need for those optional parameters, because it's using the defaults anyway:
RegExp_Substr(col, '(\d+)') || '-' ||
RegExp_Substr(col, '(\d+)(?!.*\d)')

Using jq how do I search against an array of strings? [duplicate]

Since an example is worth a thousand words, say I have the following JSON stream:
{"a": 0, "b": 1}
{"a": 2, "b": 2}
{"a": 7, "b": null}
{"a": 3, "b": 7}
How can I keep all the objects for which the .b property is one of [1, 7] (in reality the list is much longer so I don't want to do select(.b == 1 or .b == 7)). I'm looking for something like this: select(.b in [1, 7]), but I couldn't find what I'm looking for in the man page.
Doing $value in $collection could be achieved using the pattern select($value == $collection[]). A more efficient alternative would be select(any($value == $collection[]; .)) So your filter should be this:
[1, 7] as $whitelist | select(any(.b == $whitelist[]; .))
Having the array in a variable has its benefits as it lets you change the whitelist easily using arguments.
$ jq --argjson whitelist '[2, 7]' 'select(any(.b == $whitelist[]; .))'
The following approach using index/1 is similar to what was originally sought (".b in [1, 7]"), and might be noticeably faster than using .[] within select if the whitelist is large.
If your jq supports --argjson:
jq --argjson w '[1,7]' '. as $in | select($w | index($in.b))'
Otherwise:
jq --arg w '[1,7]' '. as $in | ($w|fromjson) as $w | select($w | index($in.b))'
or:
jq '. as $in | select([1, 7] | index($in.b))'
UPDATE
On Jan 30, 2017, a builtin named IN was added for efficiently testing whether a JSON entity is contained in a stream. It can also be used for efficiently testing membership in an array. For example, the above invocation with --argjson can be simplified to:
jq --argjson w '[1,7]' 'select( .b | IN($w[]) )'
If your jq does not have IN/1, then so long as your jq has first/1, you can use this equivalent definition:
def IN(s): . as $in | first(if (s == $in) then true else empty end) // false;

jq - How to select objects based on a 'whitelist' of property values

Since an example is worth a thousand words, say I have the following JSON stream:
{"a": 0, "b": 1}
{"a": 2, "b": 2}
{"a": 7, "b": null}
{"a": 3, "b": 7}
How can I keep all the objects for which the .b property is one of [1, 7] (in reality the list is much longer so I don't want to do select(.b == 1 or .b == 7)). I'm looking for something like this: select(.b in [1, 7]), but I couldn't find what I'm looking for in the man page.
Doing $value in $collection could be achieved using the pattern select($value == $collection[]). A more efficient alternative would be select(any($value == $collection[]; .)) So your filter should be this:
[1, 7] as $whitelist | select(any(.b == $whitelist[]; .))
Having the array in a variable has its benefits as it lets you change the whitelist easily using arguments.
$ jq --argjson whitelist '[2, 7]' 'select(any(.b == $whitelist[]; .))'
The following approach using index/1 is similar to what was originally sought (".b in [1, 7]"), and might be noticeably faster than using .[] within select if the whitelist is large.
If your jq supports --argjson:
jq --argjson w '[1,7]' '. as $in | select($w | index($in.b))'
Otherwise:
jq --arg w '[1,7]' '. as $in | ($w|fromjson) as $w | select($w | index($in.b))'
or:
jq '. as $in | select([1, 7] | index($in.b))'
UPDATE
On Jan 30, 2017, a builtin named IN was added for efficiently testing whether a JSON entity is contained in a stream. It can also be used for efficiently testing membership in an array. For example, the above invocation with --argjson can be simplified to:
jq --argjson w '[1,7]' 'select( .b | IN($w[]) )'
If your jq does not have IN/1, then so long as your jq has first/1, you can use this equivalent definition:
def IN(s): . as $in | first(if (s == $in) then true else empty end) // false;

How to get the index path of found values using jq?

Say I have a JSON like this:
{
"json": [
"a",
[
"b",
"c",
[
"d",
"foo",
1
],
[
[
42,
"foo"
]
]
]
]
}
And I want an array of jq index paths that contain foo:
[
".json[1][2][1]",
".json[1][3][0][1]"
]
Can I achieve this using jq and how?
I tried recurse | .foo to get the matches first but I receive an error: Cannot index array with string "foo".
First of all, I'm not sure what is the purpose of obtaining an array of jq programs. While means of doing this exist, they are seldom necessary; jq does not provide any sort of eval command.
jq has the concept of a path, which is an array of strings and numbers representing the position of an element in a JSON; this is equivalent to the strings on your expected output. As an example, ".json[1][2][1]" would be represented as ["json", 1, 2, 1]. The standard library contains several functions that operate with this concept, such as getpath, setpath, paths and leaf_paths.
We can thus obtain all leaf paths in the given JSON and iterate through them, select those for which their value in the input JSON is "foo", and generate an array out of them:
jq '[paths as $path | select(getpath($path) == "foo") | $path]'
This will return, for your given input, the following output:
[
["json", 1, 2, 1],
["json", 1, 3, 0, 1]
]
Now, although it should not be necessary, and it is most likely a sign that you're approaching whatever problem you are facing in the wrong way, it is possible to convert these arrays to the jq path strings you seek by transforming each path through the following script:
".\(map("[\(tojson)]") | join(""))"
The full script would therefore be:
jq '[paths as $path | select(getpath($path) == "foo") | $path | ".\(map("[\(tojson)]") | join(""))"]'
And its output would be:
[
".[\"json\"][1][2][1]",
".[\"json\"][1][3][0][1]"
]
Santiago's excellent program can be further tweaked to produce output in the requested format:
def jqpath:
def t: test("^[A-Za-z_][A-Za-z0-9_]*$");
reduce .[] as $x
("";
if ($x|type) == "string"
then . + ($x | if t then ".\(.)" else ".[" + tojson + "]" end)
else . + "[\($x)]"
end);
[paths as $path | select( getpath($path) == "foo" ) | $path | jqpath]
jq -f wrangle.jq input.json
[
".json[1][2][1]",
".json[1][3][0][1]"
]

In julia, how do I assign the output of an expression to a new variable?

Stupid example, I would like to do something like
X=println("hi"),
and get
X="hi".
The general solution is to use IOBuffer and takebuf_string as described by #ARM above. If it's enough to capture the output of print, then
s = string(args...)
gives the string that would have been printed by print(args...). Also,
s = repr(X)
gives the string that would have been printed by showall(X). Both are implemented using IOBuffer and takebuf_string internally.
I think the poster wants to access the nice summary format that you can get from println. One way to access that as a string is to write to a buffer using print and then read it back as a string. There's probably also an easier way.
using DataFrames
data = DataFrame()
data[:turtle] = ["Suzy", "Suzy", "Bob", "Batman", "Batman", "Bob", "Adam"]
data[:mealType] = ["bug", "worm", "worm", "bug", "worm", "worm", "stick"]
stream = IOBuffer()
println(data)
print(stream, data)
yourString = takebuf_string(stream)
returns
"7x2 DataFrame\n| Row | turtle | mealType |\n|-----|----------|----------|\n| 1 | \"Suzy\" | \"bug\" |\n| 2 | \"Suzy\" | \"worm\" |\n| 3 | \"Bob\" | \"worm\" |\n| 4 | \"Batman\" | \"bug\" |\n| 5 | \"Batman\" | \"worm\" |\n| 6 | \"Bob\" | \"worm\" |\n| 7 | \"Adam\" | \"stick\" |"
If you are after formatted strings you can use #sprintf.
julia> x = #sprintf("%s", "hi")
"hi"
julia> x
"hi"
julia> x = #sprintf("%d/%d", 3, 4)
"3/4"
It's a macro though so be careful

Resources