Why does "reduce" not reduce in jq - jq

tl;dr
In the language of jq, why is
$ jq --compact-output reduce (1,2,3,4) as $i ([]; . + [$i])
[1,2,3,4]
not the same as
$ jq --compact-output (1,2,3,4) | reduce . as $i ([]; . + [$i])
[1]
[2]
[3]
[4]
Full question and discussion
I have a somewhat theoretical question in that I have figured out a way to get the transformation I want, but still I do not understand completely why my first attempt failed and I would like an explanation.
Interactive example at jqPlay
I have input
{
"data": {
"k1": "v1"
},
"item": {
"data": {
"k2": "v2"
}
},
"list": {
"item": {
"data": {
"k3": "v3",
"k4": "v4"
}
}
}
}
and I want to collect into a single array all of the values of all of the keys that are immediate children of a "data" key. So the output I want is
["v1","v2","v3","v4"]
I eventually figured out that this works
jq --compact-output '[.. | .data? | select(.) | to_entries | .[].value]'
My question is, why could I not get it to work with reduce? I originally tried
.. | .data? | select(.) | to_entries | .[].value | reduce . as $v ([]; . + [$v])
but that gave me
["v1"]
["v2"]
["v3"]
["v4"]
instead. My question is why? reduce is supposed to iterate over multiple values, but what kind of multiple values does it iterate over and what kind are treated as separate inputs to separate reduce statements?
I guess my fundamental confusion is when is . (dot) an expression with 4 results and when is it 4 expressions? Or if . is always a an expression with 1 result, how do you collect 4 results back into 1, which is what reduce is all about? Is the array operator the only way?

An expression of the form:
reduce STREAM as ...
reduces the given stream, whereas the compound expression:
STREAM | reduce . as ...
invokes reduce once for each item in the stream, and for each invocation, . is that item.
If the concept of streams is unclear in this context, you might be interested to read a stream-oriented introduction to jq that I wrote:
https://github.com/pkoppstein/jq/wiki/A-Stream-oriented-Introduction-to-jq

Related

How can I delete all keys that don't match certain names with JQ?

I have a huge JSON file with lots of stuff I don't care about, and I want to filter it down to only the few keys I care about, preserving the structure. I won't bother if the same key name might occur in different paths and I get both of them. I gleaned something very close from the answers to this question, it taught me how to delete all properties with certain values, like all null values:
del(..|nulls)
or, more powerfully
del(..|select(. == null))
I searched high and low if I could write a predicate over the name of a property when I am looking at a property. I come from XSLT where I could write something like this:
del(..|select(name|test("^(foo|bar)$")))
where name/1 would be the function that returns the property name or array index number where the current value comes from. But it seems that jq lacks the metadata on its values, so you can only write predicates about their value, and perhaps the type of their value (that's still just a feature of the value), but you cannot inspect the name, or path leading up to it?
I tried to use paths and leaf_paths and stuff like that, but I have no clue what that would do and tested it out to see how this path stuff works, but it seems to find child paths inside an object, not the path leading up to the present value.
So how could this be done, delete everything but a set of key values? I might have found a way here:
walk(
if type == "object" then
with_entries(
select( ( .key |test("^(foo|bar|...)$") )
and ( .value != "" )
and ( .value != null ) )
)
else
.
end
)
OK, this seems to work. But I still wonder it would be so much easier if we had a way of querying the current property name, array index, or path leading up to the present item being inspected with the simple recusion ..| form.
In analogy to your approach using .. and del, you could use paths and delpaths to operate on a stream of path arrays, and delete a given path if not all of its elements meet your conditions.
delpaths([paths | select(all(IN("foo", "bar") or type == "number") | not)])
For the condition I used IN("foo", "bar") but (type == "string" and test("^(foo|bar)$")) would work as well. To also retain array elements (which have numeric indices), I added or type == "number".
Unlike in XML, there's no concept of attributes in jq. You'll need to delete from objects.
To delete an element of an object, you need to use del( obj[ key ] ) (or use with_entries). You can get a stream of the keys of an object using keys[]/keys_unsorted[] and filter out the ones you don't want to delete.
Finally, you need to invert the result of test because you want to delete those that don't match.
After fixing these problems, we get the following:
INDEX( "foo", "bar" ) as $keep |
del(
.. | objects |
.[
keys_unsorted[] |
select( $keep[ . ] | not )
]
)
Demo on jqplay
Note that I substituted the regex match with a dictionary lookup. You could use test( "^(?:foo|bar)\\z" ) in lieu of $keep[ . ], but a dictionary lookup should be faster than a regex match. And it should be less error-prone too, considering you misused $ and (...) in lieu of \z and (?:...).
The above visits deleted branches for nothing. We can avoid that by using walk instead of ...
INDEX( "foo", "bar" ) as $keep |
walk(
if type == "object" then
del(
.[
keys_unsorted[] |
select( $keep[ . ] | not )
]
)
else
.
end
)
Demo on jqplay
Since I mentioned one could use with_entries instead of del, I'll demonstrate.
INDEX( "foo", "bar" ) as $keep |
walk(
if type == "object" then
with_entries( select( $keep[ .key ] ) )
else
.
end
)
Demo on jqplay
Here's a solution that uses a specialized variant of walk for efficiency (*). It retains objects all keys of which are removed; only trivial changes are needed if a blacklist or some other criterion (e.g., regexp-based) is given instead. WHITELIST should be a JSON array of the key names to be retained.
jq --argjson whitelist WHITELIST '
def retainKeys($array):
INDEX($array[]; .) as $keys
| def r:
if type == "object"
then with_entries( select($keys[.key]) )
| map_values( r )
elif type == "array" then map( r )
else .
end;
r;
retainKeys($whitelist)
' input.json
(*) Note for example:
the use of INDEX
the recursive function, r, has arity 0
for objects, the top-level deletion occurs first.
Here's a space-efficient, walk-free approach, tailored for the case of a WHITELIST. It uses the so-called "streaming" parser, so the invocation would look like this:
jq -n --stream --argjson whitelist WHITELIST -f program.jq input.json
where WHITELIST is a JSON array of the names of the keys to be deleted, and
where program.jq is a file containing the program:
# Input: an array
# Output: the longest head of the array that includes only numbers or items in the dictionary
def acceptable($dict):
last(label $out
| foreach .[] as $x ([];
if ($x|type == "number") or $dict[$x] then . + [$x]
else ., break $out
end));
INDEX( $whitelist[]; .) as $dict
| fromstream(inputs
| if length==2
then (.[0] | acceptable($dict)) as $p
| if ($p|length) == (.[0]|length) - 1 then .[0] = $p | .[1] = {}
elif ($p|length) < (.[0]|length) then empty
else .
end
else .
end )
Note: The reason this is relatively complicated is that it assumes that you want to retain objects all of whose keys have been removed, as illustrated in the following example. If that is not the case, then the required jq program is much simpler.
Example:
WHITELIST: '["items", "config", "spec", "setting2", "name"]'
input.json:
{
"items": [
{
"name": "issue1",
"spec": {
"config": {
"setting1": "abc",
"setting2": {
"name": "xyz"
}
},
"files": {
"name": "cde",
"path": "/home"
},
"program": {
"name": "apache"
}
}
},
{
"name": {
"etc": 0
}
}
]
}
Output:
{
"items": [
{
"name": "issue1",
"spec": {
"config": {
"setting2": {
"name": "xyz"
}
}
}
},
{
"name": {}
}
]
}
I am going to put my own tentative answer here.
The thing is, the solution I had already in my question, meaning I can select keys during forward navigation, but I cannot find out the path leading up to the present value.
I looked around in the source code of jq to see how come we cannot inquire the path leading up to the present value, so we could ask for the key string or array index of the present value. And indeed it looks like jq does not track the path while it walks through the input structure.
I think this is actually a huge opportunity forfeited that could be so easily kept track during the tree walk.
This is why I continue thinking that XML with XSLT and XPath is a much more robust data representation and tool chain than JSON. In fact, I find JSON harder to read even than XML. The benefit of the JSON being so close to javascript is really only relevant if - as I do in some cases - I read the JSON as a javascript source code assigning it to a variable, and then instrument it by changing the prototype of the anonymous JSON object so that I have methods to go with them. But changing the prototype is said to cause slowness. Though I don't think it does when setting it for otherwise anonymous JSON objects.
There is JsonPath that tries (by way of the name) to be something like what XPath is for XML. But it is a poor substitute and also has no way to navigate up the parent (or then sibling) axes.
So, in summary, while selecting by key in white or black lists is possible in principle, it is quite hard, because a pretty easy to have feature of a JSON navigation language is not specified and not implemented. Other useful features that could be easily achieved in jq is backward navigation to parent or ancestor of the present value. Currently, if you want to navigate back, you need to capture the ancestor you want to get back to as a variable. It is possible, but jq could be massively improved by keeping track of ancestors and paths.

How can one filter a JSON object to select only specific key/values using jq?

I'm trying to validate all versions in a versions.json file, and get as the output a json with only the invalid versions.
Here's a sample file:
{
"slamx": "16.4.0 ",
"sdbe": null,
"mimir": null,
"thoth": null,
"quasar": null,
"connectors": {
"s3": "16.0.17",
"azure": "6.0.17",
"url": "8.0.2",
"mongo": "7.0.15"
}
}
I can use the following jq script line to do what I want:
delpaths([paths(type == "string" and contains(" ") or type == "object" | not)])
| delpaths([paths(type == "object" and (to_entries | length == 0))])
And use it on a shell like this:
BAD_VERSIONS=$(jq 'delpaths([paths(type == "string" and contains(" ") or type == "object" | not)]) | delpaths([paths(type == "object" and (to_entries | length == 0))])' versions.json)
if [[ $BAD_VERSIONS != "{}" ]]; then
echo >&2 $'Bad versions detected in versions.json:\n'"$BAD_VERSIONS"
exit 1
fi
and get this as the output:
Bad versions detected in versions.json:
{
"slamx": "16.4.0 "
}
However, that's a very convoluted way of doing the filtering. Instead of just walking the paths tree and just saying "keep this, keep that", I need to create a list of things I do not want and remove them, twice.
Given all the path-handling builtins and recursive processing, I can't help but feel that there has to be a better way of doing this, something akin to select, but working recursively across the object, but the best I could do was this:
. as $input |
[path(recurse(.[]?)|select(strings|contains("16")))] as $paths |
reduce $paths[] as $x ({}; . | setpath($x; ($input | getpath($x))))
I don't like that for two reasons. First, I'm creating a new object instead of "editing" the old one. Second and foremost, it's full of variables, which points to a severe flow inversion issue, and adds to the complexity.
Any ideas?
Thanks to #jhnc's comment, I found a solution. The trick was using streams, which makes nesting irrelevant -- I can apply filters based solely on the value, and the objects will be recomposed given the key paths.
The first thing I tried did not work, however. This:
jq -c 'tostream|select(.[-1] | type=="string" and contains(" "))' versions.json
returns [["slamx"],"16.4.0 "], which is what I'm searching for. However, I could not fold it back into an object. For that to happen, the stream has to have the "close object" markers -- arrays with just one element, corresponding to the last key of the object being closed. So I changed it to this:
jq -c 'tostream|select((.[-1] | type=="string" and contains(" ")) or length==1)' versions.json
Breaking it down, .[-1] selects the last element of the array, which will be the value. Next, type=="string" and contains(" ") will select all values which are strings and contain spaces. The last part of the select, length==1, keeps all the "end" markers. Interestingly, it works even if the end marker does not correspond to the last key, so this might be brittle.
With that done, I can de-stream it:
jq -c 'fromstream(tostream|select((.[-1] | type=="string" and contains(" ")) or length==1))' versions.json
The jq expression is as follow:
fromstream(
tostream |
select(
(
.[-1] |
type=="string" and contains(" ")
) or
length==1
)
)
For objects, the test to_entries|length == 0 can be abbreviated to length==0.
If I understand the goal correctly, you could just use .., perhaps along the following lines:
..
| objects
| with_entries(
select(( .value|type == "string" and contains(" ")) or (.value|type == "object" and length==0)) )
| select(length>0)
paths
If you want the paths, then consider:
([], paths) as $p
| getpath($p)
| objects
| with_entries(
select(( .value|type == "string" and contains(" ")) or (.value|type == "object" and length==0)) )
| select(length>0) as $x
| {} | setpath($p; $x)
With your input modified so that s3 has a trailing blank, the above produces:
{"slamx":"16.4.0 "}
{"connectors":{"s3":"16.0.17 "}}

Running jq query on alternating object type

I have json file with lots of data about image files. It has this structure:
[
{
"id": 1,
"graphic": "filename",
"export_params": {
"uses": [
"string"
]
}
},
{
"id": 2,
"graphic": "filename2",
"export_params": []
},
...
]
Most objects in this array has full export_params info, but sometimes it is just empty array. I've tried using this jq query
.[] | [.id, .graphic, .export_params.uses[], .export_params.export_type ] | #csv
to turn it into csv, but it broke on a line, where it found first "empty export_params" key. How can I bypass problem with different object type (most cases it's object, when empty, it's array - I think this is what causes my query to fail)?
The easy part of this question is handling empty arrays and missing "export_type" values, e.g.
.[]
| [.id,
.graphic,
(.export_params.uses?[] // ""),
(.export_params.export_type? // "") ]
| #csv
But what if .uses is an array with more than one item in it? That would potentially mean a variable number of values
in the rows, which might cause problems.
To restrict consideration to the first item in .uses, you could use first:
.[]
| [.id,
.graphic,
first(.export_params.uses?[] // ""),
(.export_params.export_type? // "") ]
| #csv
An alternative approach
To avoid clutter, it might be preferable to tweak the objects before querying them, e.g. along these lines:
.[]
| .export_params |= (if . == [] then {uses: [""]} else . end)
| [.id,
.graphic,
.export_params.uses[0],
.export_params.export_type ]
| #csv

What is it called when you have multiple data structures, but not connected with json in jq?

For instance, I might have something coming out of my jq command like this:
"some string"
"some thing"
"some ping"
...
Note that there is no outer object or array and no commas between items.
Or you might have something like:
["some string"
"some thing"
"some ping"]
["some wing"
"some bling"
"some fing"]
But again, no commas or outer object or array and no commas between them to indicate that this is JSON.
I keep thinking the answer is that it is called "raw", but I'm uncertain about this.
I'm specifically looking for a term to look for in the documentation that allows you to process the sorts of examples above, and I am at a loss as how to proceed.
To start with, the jq manual.yml describes the behavior of filters this way:
Some filters produce multiple results, for instance there's one that
produces all the elements of its input array. Piping that filter
into a second runs the second filter for each element of the
array. Generally, things that would be done with loops and iteration
in other languages are just done by gluing filters together in jq.
It's important to remember that every filter has an input and an
output. Even literals like "hello" or 42 are filters - they take an
input but always produce the same literal as output. Operations that
combine two filters, like addition, generally feed the same input to
both and combine the results. So, you can implement an averaging
filter as add / length - feeding the input array both to the add
filter and the length filter and then performing the division.
It's also important to keep in mind that the default behavior of jq is to run the filter you specify once for each JSON object. In the following example, jq runs the identity filter four times passing one value to it each time:
$ (echo 2;echo {}; echo []; echo 3) | jq .
2
{}
[]
3
What is happening here is similar to
$ jq -n '2, {}, [], 3 | .'
2
{}
[]
3
Since this isn't always what you want, the -s option can be used to tell jq to gather the separate values into an array and feed that to the filter:
$ (echo 2;echo {}; echo []; echo 3)| jq -s .
[
2,
{},
[],
3
]
which is similar to
$ jq -n '[2, {}, [], 3] | .'
[
2,
{},
[],
3
]
The jq manual.yml explains how the --raw-input/-R option can be included for even more control over input handing:
Don't parse the input as JSON. Instead, each line of text is passed to the filter as a string. If combined with --slurp,then the entire input is passed to the filter as a single long string.
You can see using the -s and -R options together in this example produces a different result:
$ (echo 2;echo {}; echo []; echo 3)| jq -s -R .
"2\n{}\n[]\n3\n"

Reducing JSON with jq

I have a JSON array of Objects:
[{key1: value},{key2:value}, ...]
I would like to reduce these into the following structure:
{key1: value, key2: value, ...}
Is this possible to do with jq?
I was trying:
cat myjson.json | jq '.[] | {(.key): value}'
This doesn't quite work as it iterates over each datum rather than reducing it to one Object.
Note that jq has a builtin function called 'add' that the same thing that the first answer suggests, so you ought to be able to write:
jq add myjson.json
To expand on the other two answers a bit, you can "add" two objects together like this:
.[0] + .[1]
=> { "key1": "value", "key2": "value" }
You can use the generic reduce function to repeatedly apply a function between the first two items of a list, then between that result and the next item, and so on:
reduce .[] as $item ({}; . + $item)
We start with {}, add .[0], then add .[1] etc.
Finally, as a convenience, jq has an add function which is essentially an alias for exactly this function, so you can write the whole thing as:
add
Or, as a complete command line:
jq add myjson.json
I believe the following will work:
cat myjson.json | jq 'reduce .[] as $item ({}; . + $item)'
It takes each item in the array, and adds it to the sum of all the previous items.

Resources