How can I Extract a Large json Array in U-SQL and Include meta-data? - json.net

I am trying to extract a large amount of Data from a json document. There are about 1500 nodes per json document. When I attempt to load the body node I get the 128KB limit error. I have found a way to load the node but I have to go all the way down to the array list. JsonExtractor("body.nprobe.items[*]"); The issue I am having is that I cannot access any other part of the json document, I need to get the meta data, like: Id, SerialNumber etc. Should the json file be changed in some way? The data I need is 3 levels down. The json has be obfuscated and shortened, the real file is about 33K lines of formatted json about 1500 items with n-20 fields in each.
{
"headers": {
"pin": "12345",
"Type": "URJ201W-GNZGK",
"RTicks": "3345",
"SD": "211",
"Jov": "juju",
"Market": "Dal",
"Drst": "derre",
"Model": "qw22",
"DNum": "de34",
"API": "34f",
"Id": "821402444150002501"
},
"Id": "db5aacae3778",
"ModelType": "URJ",
"body": {
"uHeader": {
"ID": "821402444150002501",
"SerialNo": "ee028861",
"ServerName": "BRXTTY123"
},
"header": {
"form": 4,
"app": 0,
"Flg": 1,
"Version": 11056,
"uploadID": 1,
"uDate": "2016-04-14T18:29"
},
"nprobe": {
"items": [{
"purchaseDate": "2016-04-14T18:21:09",
"storeLoc": {
"latitude": 135.052335,
"longitude": 77.167005
},
"sr": {
"ticks": 3822,
"SkuId": "24",
"Data": {
"X": "0.00068",
"Y": "0.07246",
}
}
},
{
"purchaseDate": "2016-04-14T18:21:09",
"storeLoc": {
"latitude": 135.052335,
"longitude": 77.167005
},
"sr": {
"ticks": 3823,
"SkuId": "25",
"Data": {
"X": "0",
"Y": "2",
}
}
}]
}
},
"Messages": []
}
Thanks.

You'll have to use CROSS APPLY:
https://msdn.microsoft.com/en-us/library/azure/mt621307.aspx
and EXPLODE:
https://msdn.microsoft.com/en-us/library/azure/mt621306.aspx
See a worked out solution here:
https://github.com/algattik/USQLHackathon/blob/master/VS-Solution/USQLApplication/ciam-to-sqldw.usql
https://github.com/algattik/USQLHackathon/blob/master/Samples/Customer/customers%202016-08-10.json
-- IMPROVED ANSWER: --
As this solution will not work for you since your inner JSON is too large to fit in a string, you can parse the input twice:
DECLARE #input string = #"/so.json";
REFERENCE ASSEMBLY [Newtonsoft.Json];
REFERENCE ASSEMBLY [Microsoft.Analytics.Samples.Formats];
#meta =
EXTRACT Id string
FROM #input
USING new Microsoft.Analytics.Samples.Formats.Json.JsonExtractor();
#items =
EXTRACT purchaseDate string
FROM #input
USING new Microsoft.Analytics.Samples.Formats.Json.JsonExtractor("body.nprobe.items[*]");
#itemsFull =
SELECT Id,
purchaseDate
FROM #meta
CROSS JOIN #items;
OUTPUT #itemsFull
TO "/items_full.csv"
USING Outputters.Csv();

Related

How to convert json array in to the columns table in kusto

I have one parquet file ,trying to get the data in to the table. In one column it have json with multiple values. Can someone help me how to do in kusto?
Pasting the json's schema:
{
"$schema": "http://json-schema.org/draft-04/schema#",
"type": "object",
"properties": {
"path": {
"type": "string"
},
"partitionValues": {
"type": "object",
"properties": {
"deviceId": {
"type": "string"
},
"date": {
"type": "string"
}
},
"required": [
"deviceId",
"date"
]
},
"size": {
"type": "integer"
},
"modificationTime": {
"type": "integer"
},
"dataChange": {
"type": "boolean"
},
"stats": {
"type": "string"
}
},
"required": [
"path",
"partitionValues",
"size",
"modificationTime",
"dataChange",
"stats"
]
}
If I understood correctly, your PQ file contains a column with a JSON of the specified schema. If you want to ingest it as-is, ingest it into Kusto column with type "dynamic" and query later. If you'd like to ingest just part of this JSON data (like some inner fields), use ingestion mapping and provide appropriate JSON path.
Another option is to ingest as-is into a source table with a Retention policy with SoftDeletePeriod of zero. Define an Update Policy with a KQL query and the transformed data will be pushed into a target table.

Should sparse fields prevent compound documents from showing?

Given the following:
GET http://www.example.com/post/1?include=author
{
"type": "post",
"id": "1",
"attributes": {
"title": "It's a title.",
"description": "It's a description."
},
"relationships": {
"author": {
"links": {
"self": "http://example.com/articles/1/relationships/author",
"related": "http://example.com/people?filter[article]=1"
},
"data":{
"type":"people", "id":"9"
}
}
},
"included":[
{
"type":"people",
"id": "9",
"attributes": {
"first-name": "Dan",
"last-name": "Gebhardt",
"twitter": "dgeb"
},
"links": {
"self": "http://example.com/people/9"
}
}
],
"links": {
"self": "http://example.com/articles/1"
}
}
If I were to append fields[post]=title, (i.e. GET http://www.example.com/post/1?include=author&fields[post]=title) should this prevent included (compound document) from displaying?
GET http://www.example.com/post/1?include=author&fields[post]=title
{
"type": "post",
"id": "1",
"attributes": {
"title": "It's a title.",
},
"relationships": {
"author": {
"links": {
"self": "http://example.com/articles/1/relationships/author",
"related": "http://example.com/people?filter[article]=1"
},
"data":{
"type":"people", "id":"9"
}
}
}
}
Or are compound documents still supposed to render?
Sparse Fieldsets and Compound Documents could be used together. The spec explicitly lists sparse fieldsets as the only allowed case in which a included resource may not be linked by another other resource in the same document:
** Compound Documents**
Compound documents require “full linkage”, meaning that every included resource MUST be identified by at least one resource identifier object in the same document. These resource identifier objects could either be primary data or represent resource linkage contained within primary or included resources.
The only exception to the full linkage requirement is when relationship fields that would otherwise contain linkage data are excluded via sparse fieldsets.
To directly answer your question: The spare fieldset should only affect the linkage between the resources (e.g. by not including the relationship field between them) but should not cause a resource to be not included that otherwise would.

How to associate nested relationships with attributes for a POST in JSON API

According to the spec, resource identifier objects do not hold attributes.
I want to do a POST to create a new resource which includes other nested resource.
These are the basic resources: club (with name) and many positions (type). Think a football club with positions like goalkeeper, goalkeeper, striker, striker, etc.
When I do this association, I want to set some attributes like is the position required for this particular team. For example I only need 1 goalkeeper but I want to have a team with many reserve goalkeepers. When I model these entities in the DB I'll set the required attribute in a linkage table.
This is not compliant with JSON API:
{
"data": {
"type": "club",
"attributes": {
"name": "Backyard Football Club"
},
"relationships": {
"positions": {
"data": [{
"id": "1",
"type": "position",
"attributes": {
"required": "true"
}
}, {
"id": "1",
"type": "position",
"attributes": {
"required": "false"
}
}
]
}
}
}
}
This is also not valid:
{
"data": {
"type": "club",
"attributes": {
"name": "Backyard Football Club",
"positions": [{
"position_id": "1",
"required": "true"
},
{
"position_id": "1",
"required": "false"
}]
}
}
}
So how is the best way to approach this association?
The best approach here will be to create a separate resource for club_position
Creating a club will return a url to a create club_positions, you will then post club_positions to that url with a relationship identifier to the position and club resource.
Added benefit to this is that club_positions creation can be parallelized.

Document Db query filter for an attribute that contains an array

With the sample json shown below, am trying to retrieve all documents that contains atleast one category which is array object wrapped underneath Categories that has the text value 'drinks' with the following query but the returned result is empty. Can someone help me get this right?
SELECT items.id
,items.description
,items.Categories
FROM items
WHERE ARRAY_CONTAINS(items.Categories.Category.Text, "drink")
{
"id": "1dbaf1d0-6549-11a0-88a8-001256957023",
"Categories": {
"Category": [{
"Type": "GS1",
"Id": "10000266",
"Text": "Stimulants/Energy Drinks Ready to Drink"
}, {
"Type": "GS2",
"Id": "10000266",
"Text": "Healthy Drink"
}]
}
},
Note: The json is a bit wierd to have the array wrapped by an object itself - this json was converted from a XML hence the result. So please assume I do not have any control over how this object is saved as json
You need to flatten the document in your query to get the result you want by joining the array back to the main document. The query you want would look like this:
SELECT items.id, items.Categories
FROM items
JOIN Category IN items.Categories.Category
WHERE CONTAINS(LOWER(Category.Text), "drink")
However, because there is no concept of a DISTINCT query, this will produce duplicates equal to the number of Category items that contain the word "drink". So this query would produce your example document twice like this:
[
{
"id": "1dbaf1d0-6549-11a0-88a8-001256957023",
"Categories": {
"Category": [
{
"Type": "GS1",
"Id": "10000266",
"Text": "Stimulants/Energy Drinks Ready to Drink"
},
{
"Type": "GS2",
"Id": "10000266",
"Text": "Healthy Drink"
}
]
}
},
{
"id": "1dbaf1d0-6549-11a0-88a8-001256957023",
"Categories": {
"Category": [
{
"Type": "GS1",
"Id": "10000266",
"Text": "Stimulants/Energy Drinks Ready to Drink"
},
{
"Type": "GS2",
"Id": "10000266",
"Text": "Healthy Drink"
}
]
}
}
]
This could be problematic and expensive if the Categories array holds a lot of Category items that have "drink" in them.
You can cut that down if you are only interested in a single Category by changing the query to:
SELECT items.id, Category
FROM items
JOIN Category IN items.Categories.Category
WHERE CONTAINS(LOWER(Category.Text), "drink")
Which would produce a more concise result with only the id field repeated with each matching Category item showing up once:
[{
"id": "1dbaf1d0-6549-11a0-88a8-001256957023",
"Category": {
"Type": "GS1",
"Id": "10000266",
"Text": "Stimulants/Energy Drinks Ready to Drink"
}
},
{
"id": "1dbaf1d0-6549-11a0-88a8-001256957023",
"Category": {
"Type": "GS2",
"Id": "10000266",
"Text": "Healthy Drink"
}
}]
Otherwise, you will have to filter the results when you get them back from the query to remove duplicate documents.
If it were me and I was building a production system with this requirement, I'd use Azure Search. Here is some info on hooking it up to DocumentDB.
If you don't want to do that and we must live with the constraint that you can't change the shape of the documents, the only way I can think to do this is to use a User Defined Function (UDF) like this:
function GetItemsWithMatchingCategories(categories, matchingString) {
if (Array.isArray(categories) && categories !== null) {
var lowerMatchingString = matchingString.toLowerCase();
for (var index = 0; index < categories.length; index++) {
var category = categories[index];
var categoryName = category.Text.toLowerCase();
if (categoryName.indexOf(lowerMatchingString) >= 0) {
return true;
}
}
}
}
Note, the code above was modified by the asker after actually trying it out so it's somewhat tested.
You would use it with a query like this:
SELECT * FROM items WHERE udf.GetItemsWithMatchingCategories(items.Categories, "drink")
Also, note that this will result in a full table scan (unless you can combine it with other criteria that can use an index) which may or may not meet your performance/RU limit constraints.

How to create compound documents?

I'm thinking of using the JSONAPI standard for the design of our API. One thing this API must be able to do, is accept a compound document (several layers deep) and create it. The root object owns all descendants ('to-many' relationships) which the server knows nothing about at that point, so it's not possible for the client to provide an id.
Is this supported by the specification or does the client have to issue http requests for every object in the document in order?
from http://jsonapi.org/format/#document-compound-documents
Compound documents require "full linkage", meaning that every included
resource MUST be identified by at least one resource identifier object
in the same document. These resource identifier objects could either
be primary data or represent resource linkage contained within primary
or included resources. The only exception to the full linkage
requirement is when relationship fields that would otherwise contain
linkage data are excluded via sparse fieldsets.
{
"data": [{
"type": "articles",
"id": "1",
"attributes": {
"title": "JSON API paints my bikeshed!"
},
"links": {
"self": "http://example.com/articles/1"
},
"relationships": {
"author": {
"links": {
"self": "http://example.com/articles/1/relationships/author",
"related": "http://example.com/articles/1/author"
},
"data": { "type": "people", "id": "9" }
},
"comments": {
"links": {
"self": "http://example.com/articles/1/relationships/comments",
"related": "http://example.com/articles/1/comments"
},
"data": [
{ "type": "comments", "id": "5" },
{ "type": "comments", "id": "12" }
]
}
}
}],
"included": [{
"type": "people",
"id": "9",
"attributes": {
"first-name": "Dan",
"last-name": "Gebhardt",
"twitter": "dgeb"
},
"links": {
"self": "http://example.com/people/9"
}
}, {
"type": "comments",
"id": "5",
"attributes": {
"body": "First!"
},
"links": {
"self": "http://example.com/comments/5"
}
}, {
"type": "comments",
"id": "12",
"attributes": {
"body": "I like XML better"
},
"links": {
"self": "http://example.com/comments/12"
}
}]
}

Resources