How to extract keys in a nested json array object in Presto? - jsonpath

I'm using the latest(0.117) Presto and trying to execute CROSS JOIN UNNEST with complex JSON array like this.
[{"id": 1, "value":"xxx"}, {"id":2, "value":"yy"}, ...]
To do that, first I tried to make an ARRAY with the values of id by
SELECT CAST(JSON_EXTRACT('[{"id": 1, "value":"xxx"}, {"id":2, "value":"yy"}]', '$..id') AS ARRAY<BIGINT>)
but it doesn't work.
What is the best JSON Path to extract the values of id?

This will solve your problem. It is more generic cast to an ARRAY of json (less prone to errors given an arbitrary map structure):
select
TRANSFORM(CAST(JSON_PARSE(arr1) AS ARRAY<JSON>),
x -> JSON_EXTRACT_SCALAR(x, '$.id'))
from
(values ('[{"id": 1, "value":"xxx"}, {"id":2, "value":"yy"}]')) t(arr1)
Output in presto:
[1,2]
... I ran into a situation where a list of jsons was nested within a json. My list of jsons had an ambiguous nested map structure. The following code returns an array of values given a specific key in a list of jsons.
Extract the list using JSON EXTRACT
Cast the list as an array of jsons
Loop through the json elements in the array using the TRANSFORM function and extract the value of the key that you are interested in.
>
TRANSFORM(CAST(JSON_EXTRACT(json, '$.path.toListOfJSONs') AS ARRAY<JSON>),
x -> JSON_EXTRACT_SCALAR(x, '$.id')) as id

You can cast the JSON into an ARRAY of MAP, and use transform lambda function to extract the "id" key:
select
TRANSFORM(CAST(JSON_PARSE(arr1) AS ARRAY<MAP<VARCHAR, VARCHAR>>), entry->entry['id'])
from
(values ('[{"id": 1, "value":"xxx"}, {"id":2, "value":"yy"}]')) t(arr1)
output:
[1, 2]

Now, you can use presto-third-functions , It provide json_array_extract function, you can extract json array info like this:
select
json_array_extract_scalar(arr1, '$.book.id')
from
(values ('[{"book":{"id":"12"}}, {"book":{"id":"14"}}]')) t(arr1)
output is:
[12, 14]

I finally gave up finding a simple JSON Path to extract them.
Instead, I wrote a redundant dirty query like the following to make the task done.
SELECT
...
FROM
(
SELECT
SLICE(ARRAY[
JSON_EXTRACT(json_column, '$[0].id'),
JSON_EXTRACT(json_column, '$[1].id'),
JSON_EXTRACT(json_column, '$[2].id'),
...
], JSON_ARRAY_LENGTH(json_column)) ids
FROM
the.table
) t1
CROSS JOIN UNNEST(ids) AS t2(id)
WHERE
...
I still want to know the best practice if you know another good way to CROSS JOIN them!

Related

Different array types, for comparison in the set_difference() function

I'm having some issues getting expected results from set_difference(). I assumed I was comparing two dynamic arrays, but I'm not sure where the gap is. The only additional insight I have is that when I compare the two arrays using the gettype() function, I get the following:
First array
Created using a make_list aggregation, e.g.
| summarize inv_list = make_list(Date)
When I run gettype() on the array:
"type_inv_list": array
Second array
Created through a scalar function
let period_check_range = todynamic(range(make_datetime(start_date), datetime_add('day',8,make_datetime(start_date)),1d));
When I run gettype() on the array:
"type_range___scalar_90e56a216d8942f28e6797e5abc35dd9": array
Any guidance on how to make these arrays work so I can use the set_difference() function?
You're missing toscalar() (see doc) in the first array. When you run | summarize ... you get a table as a result, but what you actually want is a single scalar, what's why toscalar() is needed.
Here's how to achieve what you want:
let StartDate = ago(10d);
let Array1 = toscalar(MyTable | summarize make_set(Timestamp));
let Array2 = todynamic(range(make_datetime(StartDate), datetime_add('day',8,make_datetime(StartDate)),1d));
print set_difference(Array2, Array1)
By the way, you probably want to use make_set and not make_list as you're not interested in duplicate values.

searching in multiple collections joined by common fileds in xquery marklogic

I have two collections('A' and 'B') with millions of transport insurance data documents. The two collections have four elements in common(customer-no, date-of-insurance, insurance-no,accident-number) and one element(license-no) exists only in one collection('A'). I want to extract all the documents that are present in both the collections and also have the element of collection'A'. I am able to retrieve all the customer-nos from 'A' with cts-search. Then I loop through each of these customer-nos to look for license-no in 'A'. It gives an empty sequence. But I know this is not possible. Could someone guide me with appropriate search logic?
let $col-A := cts:search(
doc(),
cts:and-query((
cts:collection-query('col-A'),
cts:element-value-query(xs:QName('abc:Acusno'), '*', (("wildcarded")))
)))
for $each in $col-A
let $col-B := cts:search(doc(),
cts:and-query((cts:collection-query('col-B'),
cts:element-value-query(xs:QName('abc:Bcusno'), $each)
)))
return $col-B
returns empty sequence
Your first cts:search is returning entire documents, which you are then passing in as argument into the value-query. You probably want to pass in just the value of abc:Acusno. You could do that with something like $each//abc:Acusno.
Your code is not using a very efficient approach though, and what if certain Acusno values occur multiple times?
I would recommend putting a range index on abc:Acusno, and using cts:values to pull up the unique values that match a given query. Then feed that entire list as one argument without any looping to a query against abc:Bcusno. You don't have to use a range index, and range query on Bcusno, but it could be useful to have that index anyhow. The code would then look something like this:
let $query :=
cts:and-query((
cts:collection-query('col-A'),
cts:element-query(xs:QName('abc:Acusno'), cts:true-query())
))
let $customerNrs :=
cts:values(
cts:element-reference(xs:QName("abc:Acusno")),
(),
(),
$query
)
return cts:search(
collection(),
cts:and-query((
cts:collection-query('col-B'),
cts:element-range-query(xs:QName('abc:Bcusno'), '=', $customerNrs)
))
)
Note: be careful when returning full search lists like this. You might want to paginate the response.
HTH!

Using Rascal MAP

I am trying to create an empty map, that will be then populated within a for loop. Not sure how to proceed in Rascal. For testing purpose, I tried:
rascal>map[int, list[int]] x;
ok
Though, when I try to populate "x" using:
rascal>x += (1, [1,2,3])
>>>>>>>;
>>>>>>>;
^ Parse error here
I got a parse error.
To start, it would be best to assign it an initial value. You don't have to do this at the console, but this is required if you declare the variable inside a script. Also, if you are going to use +=, it has to already have an assigned value.
rascal>map[int,list[int]] x = ( );
map[int, list[int]]: ()
Then, when you are adding items into the map, the key and the value are separated by a :, not by a ,, so you want something like this instead:
rascal>x += ( 1 : [1,2,3]);
map[int, list[int]]: (1:[1,2,3])
rascal>x[1];
list[int]: [1,2,3]
An easier way to do this is to use similar notation to the lookup shown just above:
rascal>x[1] = [1,2,3];
map[int, list[int]]: (1:[1,2,3])
Generally, if you are just setting the value for one key, or are assigning keys inside a loop, x[key] = value is better, += is better if you are adding two existing maps together and saving the result into one of them.
I also like this solution sometimes, where you instead of joining maps just update the value of a certain key:
m = ();
for (...whatever...) {
m[key]?[] += [1,2,3];
}
In this code, when the key is not yet present in the map, then it starts with the [] empty list and then concatenates [1,2,3] to it, or if the key is present already, let's say it's already at [1,2,3], then this will create [1,2,3,1,2,3] at the specific key in the map.

DynamoDB nested attribute querying support

Does Amazon DynamoDB scan operation allow you to query on nested attributes of type Array or Object? For example,
{
Id: 206,
Title: "20-Bicycle 206",
Description: "206 description",
RelatedItems: [
341,
472,
649
],
Pictures: {
FrontView: "123",
RearView: "456",
SideView: "789"
}
}
Can I query on RelatedItems[2] or Pictures.RearView attributes?
Yes, you can use a Filter Expression, which is just like Condition Expression. The section that talks about the functions that you can use in these types of expressions mentions the following:
"For a nested attribute, you must provide its full path; for more information, see Document Paths."
The Document Paths reference has examples on how to reference nested attributes in DynamoDB data types like List (what you are calling an array) and Map (what you are calling an object). Check out that reference for examples on how to do so:
MyList[0]
AnotherList[12]
ThisList[5][11]
MyMap.nestedField
MyMap.nestedField.deeplyNestedField
Please note that in DyanomoDB query and scan are quite different (scan is a much costlier operation). So while you can filter on both as pointed out by #coffeeplease; you can only query/index on:
The key schema for the index. Every attribute in the index key schema must be a top-level attribute of type String, Number, or Binary. Other data types, including documents and sets, are not allowed (ref).
Yes, you can by passing list or value.
data = table.scan(FilterExpression=Attr('RelatedItems').contains([1, 2, 3]) & Attr('Pictures.RearView').eq('1'))
Yes, you can query on nested attributes of type array or object using scan or query .
Reference for Python boto3:
https://boto3.amazonaws.com/v1/documentation/api/latest/guide/dynamodb.html#querying-and-scanning
Example: Suppose you want to find out records for which the RearView" > 500 and second item of RelatedItems" > 200, you can do the following:
data = table.scan(
FilterExpression=Attr('RelatedItems[1]').gt('200') & Attr('Pictures.RearView').gt('500'))

How to make a varibale+wildcard dict:fetch in Erlang

I have the following structure of a dictionary in Erlang:
Key: {element_name, a, element_type, type_1}
Value: [list].
Dictionary: (({element_name, a, element_type, type_1},[List]), ({element_name, b, element_type, type_2},[List])).
I would like to update a certain key-value pair and insert some new data into the 'key' tuple (not into 'value' list):
1. Value_list = dict:fetch({element_name, a, element_type, _}, Dict).
2. Dict2 = dict:erase ({element_name, a, element_type, _}, Dict).
3. Dict3 = dict:store ({element_name, a, element_type, New_type}, Value_list, Dict2).
The problem is that at line 1 Erlang says that variable "_" is unbound.
It seems that I cannot fetch a value by providing only a part of the key if the key is a tuple. Is this true?
Is it actually possible to update a key in a dictionary?
Is there any shorter way to do this instead of doing 1,2 and 3?
dict doesn't support what you want to do. you will have to know the key, erase the old key/value pair, and store a new one.
take a look at ets. you can use ets:match to find keys that match your spec. you'll still have to delete the old key/value pair and insert a new one.
If you insist on updating the Key in the dictionary without deleting it and later storing a new value against it, i suggest that you first convert your Dict into a list by this: dict:to_list/1. Consider this piece of code:
Fun_to_match_key = fun({{element_name, a, element_type, _} = Key,Value})->
%% do some stuff here with the Key and value and assuming
%% this fun returns the new Key-Value pair you want
New_Key = update_my_key(Key),
New_Value = update_my_value_if_need_to(Value),
{New_Key,New_Value};
(Any)-> Any
end,
%% Then in one operation, you convert the dict into a list, apply the
%% fun above in a list comprehension and convert the list back to a dict
New_dict = dict:from_list([Fun_to_match_key(Key_Value_Pair) || Key_Value_Pair <- dict:to_list(Old_Dict)]),
New_dict.
Converting the dict into a list will give you a proplist() which is much easier to manipulate either Key or value. You could use any method say, recursion with several clauses in which you pattern match the nature of Key you want to manipulate, in the above example i have chosen to use a fun within a list comprehension.
That should do the trick!

Resources