JQ: Nested JSON Array transformation - jq

Since some month ago i had a little problem with a jq Transformation (j1 1.5 on Windows 10). Since them the command worked excellent: "[{nid, title, nights, company: .operator.shortTitle, zone: .zones[0].title}
+ (.sails[] | { sails_nid: .nid, arrival, departure } )
+ (.sails[].cabins[] | { cabinname: .cabinType.title, cabintype: .cabinType.kindName, cabinnid: .cabinType.nid, catalogPrice, discountPrice, discountPercentage, currency } )]". Since some days ago the api deliver "bigger" json files JSON File. With the jq command i got a lot of duplicates (with the attached file i got around 3146 objects, expected objects are arround 250). I tried to Change the jq command to avoid the duplicates but had no "luck" on that.
The json files contains a variable amount of sails (10 in these case), while each sail has a variable amount of cabins (25 in this case). Any tips how i can realize that? Regards timo

This is probably what you're looking for:
[{nid, title, nights, company: .operator.shortTitle, zone: .zones[0].title}
+ (.sails[] | ({ sails_nid: .nid, arrival, departure } +
(.cabins[] | { cabinname: .cabinType.title,
cabintype: .cabinType.kindName,
cabinnid: .cabinType.nid,
catalogPrice,
discountPrice,
discountPercentage,
currency } ))) ]
Hopefully the layout will clarify the difference with your jq filter.

Related

JSON to CSV conversion using jq

I have a 1GB JSON file I like to convert to CSV format. The file contains information about UK company people with significant control (PSC).
file source: http://download.companieshouse.gov.uk/en_pscdata.html
here is a data snippet of PSC Data product:
{"company_number":"04502074","data":{"address":{"address_line_1":"Grove Hall","address_line_2":"Ashbourne Green","locality":"Ashbourne","postal_code":"DE6 1JD","region":"Derbyshire"},"country_of_residence":"England","date_of_birth":{"month":11,"year":1964},"etag":"f9a632332f63b61f004569f99d6b15e3e6d28192","kind":"individual-person-with-significant-control","links":{"self":"/company/04502074/persons-with-significant-control/individual/34zTsx2BFGMyn0lJe2REL656U8w"},"name":"Mr Philip Anthony Donlan","name_elements":{"forename":"Philip","middle_name":"Anthony","surname":"Donlan","title":"Mr"},"nationality":"British","natures_of_control":["ownership-of-shares-50-to-75-percent"],"notified_on":"2016-04-06"}}
{"company_number":"10260075","data":{"address":{"country":"England","locality":"Widnes","postal_code":"WA8 9DH","premises":"1 Stockswell Farm Court"},"country_of_residence":"England","date_of_birth":{"month":12,"year":1978},"etag":"dbf13fc08cb9136450089681b6e9364eb8458129","kind":"individual-person-with-significant-control","links":{"self":"/company/10260075/persons-with-significant-control/individual/Br24rkYIl3ZKam3C9fT4o_9uF7k"},"name":"Mr Daniel Thomas Ross","name_elements":{"forename":"Daniel","middle_name":"Thomas","surname":"Ross","title":"Mr"},"nationality":"English","natures_of_control":["significant-influence-or-control"],"notified_on":"2016-07-01"}}
{"company_number":"SC539354","data":{"address":{"address_line_1":"5 West Victoria Dock Road","locality":"Dundee","postal_code":"DD1 3JT","premises":"Begbies Traynor (Central) Llp, River Court"},"country_of_residence":"Scotland","date_of_birth":{"month":4,"year":1980},"etag":"37599a22ede050072457db60af6e75ba8e237246","kind":"individual-person-with-significant-control","links":{"self":"/company/SC539354/persons-with-significant-control/individual/T7LPjXkKRuaunfMRjfFrnWiHEnI"},"name":"Mr Stuart Hemple","name_elements":{"forename":"Stuart","surname":"Hemple","title":"Mr"},"nationality":"British","natures_of_control":["ownership-of-shares-75-to-100-percent","voting-rights-75-to-100-percent","right-to-appoint-and-remove-directors","significant-influence-or-control"],"notified_on":"2016-07-01"}}
{"company_number":"02722495","data":{"address":{"address_line_1":"Beechdene","address_line_2":"108 Coventry Road","locality":"Warwick","postal_code":"CV34 5HH"},"country_of_residence":"England","date_of_birth":{"month":12,"year":1953},"etag":"610138d3809ab3237b609f3cb93bfe4bf89d7581","kind":"individual-person-with-significant-control","links":{"self":"/company/02722495/persons-with-significant-control/individual/8g4ED3usT4wLPEqra7dE97eqmHE"},"name":"Mr Marshall Fenn Stephenson","name_elements":{"forename":"Marshall","middle_name":"Fenn","surname":"Stephenson","title":"Mr"},"nationality":"British","natures_of_control":["ownership-of-shares-25-to-50-percent"],"notified_on":"2016-07-01"}}
{"company_number":"05495733","data":{"address":{"address_line_1":"Brompton Road","country":"England","locality":"London","postal_code":"SW3 2AS","premises":"253"},"country_of_residence":"Italy","date_of_birth":{"month":4,"year":1953},"etag":"d45e5d2aa905e769e6fd3aa364e301f73b047985","kind":"individual-person-with-significant-control","links":{"self":"/company/05495733/persons-with-significant-control/individual/Oqp-z-D5JTX0mjXTtmOqmct1vR4"},"name":"Mr Roberto Gavazzi","name_elements":{"forename":"Roberto","surname":"Gavazzi","title":"Mr"},"nationality":"Italian","natures_of_control":["ownership-of-shares-50-to-75-percent","voting-rights-50-to-75-percent","right-to-appoint-and-remove-directors-as-firm","significant-influence-or-control-as-firm"],"notified_on":"2016-06-30"}}
{"company_number":"SC539355","data":{"address":{"address_line_1":"6 Dryden Road","country":"Scotland","locality":"Loanhead","postal_code":"EH20 9LZ","premises":"Bilston Glen Business Centre"},"country_of_residence":"Scotland","date_of_birth":{"month":10,"year":1990},"etag":"b03abb8bb1b6f95039dd896210d7c231d8784c31","kind":"individual-person-with-significant-control","links":{"self":"/company/SC539355/persons-with-significant-control/individual/tYyjuJrp6Ifp327VxThVGeRswMM"},"name":"Mr David John Kelly","name_elements":{"forename":"David","middle_name":"John","surname":"Kelly","title":"Mr"},"nationality":"Scottish","natures_of_control":["ownership-of-shares-75-to-100-percent"],"notified_on":"2016-07-01"}}
{"company_number":"SC539356","data":{"address":{"address_line_1":"Scholes","country":"England","locality":"Wigan","postal_code":"WN1 1YF","premises":"106 Douglas House"},"country_of_residence":"England","date_of_birth":{"month":3,"year":1961},"etag":"2f15d0fbacc68763b00e203ab0820b0911ac5906","kind":"individual-person-with-significant-control","links":{"self":"/company/SC539356/persons-with-significant-control/individual/0eATs-Ecoj9ie0_pBCq29L6UtlM"},"name":"Mr Mark Edward Sowery","name_elements":{"forename":"Mark","middle_name":"Edward","surname":"Sowery","title":"Mr"},"nationality":"British","natures_of_control":["ownership-of-shares-75-to-100-percent"],"notified_on":"2016-07-01"}}
{"company_number":"07674942","data":{"address":{"address_line_1":"Old Gloucester Street","country":"England","locality":"London","postal_code":"WC1N 3AX","premises":"27"},"country_of_residence":"Sierra Leone","date_of_birth":{"month":2,"year":1979},"etag":"f2e9cc0033cd5ef24fcda06baeff06bb8ea72654","kind":"individual-person-with-significant-control","links":{"self":"/company/07674942/persons-with-significant-control/individual/D31C5Na0B1I4rqM1RYwpy3J8oKA"},"name":"Mr Muhammad Umar Babar","name_elements":{"forename":"Muhammad","middle_name":"Umar","surname":"Babar","title":"Mr"},"nationality":"Pakistani","natures_of_control":["ownership-of-shares-75-to-100-percent"],"notified_on":"2016-07-01"}}
{"company_number":"09639364","data":{"address":{"address_line_1":"Galmington Road","country":"United Kingdom","locality":"Taunton","postal_code":"TA1 5NP","premises":"58b","region":"Somerset"},"country_of_residence":"United Kingdom","date_of_birth":{"month":12,"year":1977},"etag":"25ff7d41f9b8f257f0d41ae82e88202017beff34","kind":"individual-person-with-significant-control","links":{"self":"/company/09639364/persons-with-significant-control/individual/qlPpucOQopiIgq1xzZIb6xjO5JQ"},"name":"Mr Li Ying Cao","name_elements":{"forename":"Li","middle_name":"Ying","surname":"Cao","title":"Mr"},"nationality":"British","natures_of_control":["ownership-of-shares-75-to-100-percent"],"notified_on":"2016-07-01"}}
{"company_number":"08541397","data":{"address":{"address_line_1":"Hedley Avenue","locality":"Blyth","postal_code":"NE24 3JP","premises":"27","region":"Northumberland"},"country_of_residence":"England","date_of_birth":{"month":11,"year":1949},"etag":"b843664ca67a4274ee6f6cc9816ab35cd8367190","kind":"individual-person-with-significant-control","links":{"self":"/company/08541397/persons-with-significant-control/individual/KuddC6fZH17ifaXSAVWEcC2ba74"},"name":"Mr David Harwood","name_elements":{"forename":"David","surname":"Harwood","title":"Mr"},"nationality":"British","natures_of_control":["ownership-of-shares-75-to-100-percent"],"notified_on":"2016-05-23"}}
{"company_number":"02832188","data":{"address":{"address_line_1":"Lodge Road","country":"England","locality":"London","postal_code":"NW4 4DD","premises":"1"},"country_of_residence":"England","date_of_birth":{"month":5,"year":1952},"etag":"0c21b2b560ee43ca0c2ffd9f07d5ca564536b6e2","kind":"individual-person-with-significant-control","links":{"self":"/company/02832188/persons-with-significant-control/individual/Rh8pb-L7fEZzkyhttuCwVjjL_eA"},"name":"Mrs Rachel Weissman","name_elements":{"forename":"Rachel","surname":"Weissman","title":"Mrs"},"nationality":"British","natures_of_control":["ownership-of-shares-25-to-50-percent","voting-rights-25-to-50-percent","right-to-appoint-and-remove-directors","significant-influence-or-control"],"notified_on":"2016-07-01"}}
I have created a input file in.json, contain my json data as providet by companies house: file source: http://download.companieshouse.gov.uk/en_pscdata.html
I have created a output file out.csv
I am trying to run the follwing code:
jq -r 'map({company_number,address_line_1,country,locality,postal_code,premises,ceased_on,country_of_residence,month,year,etag,kind}) | (first | keys_unsorted) as $keys | map([to_entries[] | .value]) as $rows | $keys,$rows[] | #csv' in.json > out.csv
im getting the following error:
jq: error (at in.json:0): Cannot index string with string "company_number"
please advise on what am I doing wrong and how to get this done.
Since you are selecting bits of data from different levels of the input objects, you will need to specify the selection more precisely.
As your input consists of a stream of JSON objects, let's start with a function for reading one of those objects:
# Input and output: a JSON object
def get:
{company_number} as $number
| .data
| (.address | {address_line_1,country,locality,postal_code,premises}) as $address
| {ceased_on,country_of_residence} as $details
| (.date_of_birth | {month, year}) as $dob
| $number + $address + $details + $dob + {etag,kind}
;
There are several ways to read JSON streams, but it's quite convenient to use use input and inputs with the -n command-line option.
To make things easy to read, let's next define another helper function for producing an array of the relevant data:
def getRow:
get | [.[]];
Putting it all together:
(input|get)
| keys_unsorted,
[.[]],
(inputs | getRow)
| #csv
Don't forget the -r and -n command-line options!
Footnote:
In general, using [.[]] to "flatten" a JSON object to a flat array of values is ill-advised, but in the present case, we have ensured a consistent ordering of keys in get, and it is reasonable to assume that none of the values in the selected fields are compound, as suggested by the snippet and the 500,000 records in one of the snapshot files. A "robustification" would, however, be trivial to achieve (e.g. using tostring), and might therefore be advisable.
If you were using gojq (the Go implementation of jq), you would have to do things slightly differently as gojq does not respect user-specified ordering of keys. You'd have to generate the header row differently and make minor changes to get.

How to check if the device's time is between two times in Flutter from Firebase/Firestore?

In the Firestore project, I have documents in a collection containing data for shops, having fields like shopName, shopAddress, startTime(eg. 10 AM) and closeTime(eg. 10 PM) . (all strings for now)
When the user is browsing the app, i have retrieved the data from Firestore of the shops displayed in the app, now i wanna show that the shop is closed when the device's time is not between the startTime and closeTime of the shop. How do i achieve this?
So far I can detect the device's current time using dart package intl using this code:
print("${DateFormat('j').format(DateTime.now())}");
It gives output as follows:
I/flutter (14877): 6 PM
This is in DateFormat, and the data types stored in Firestore are strings.. I dont know how to compare them.. Do let me know if i have to change the data types in Firestore too.
Thank You
I think if you use 24 Hour Time Format and convert startTime, closeTime and actualTime to int or double ( if the shop close at 20:30/8:30pm), then you can easily compare them with if. On your firebase server string format is perfect.
For example you make a map and iterate it, and check if the actualTime is higher than startTime and lower than closeTime.
I have never tried this code, but i think it is going to work.
Map map = {'1am': 1, '2am': 2, '3am': 3, ... , '11pm': 23};
map.entries.forEach((e) {
if(e.key == actualTime) {
if(e.value >= startTime && e.value < closeTime) {
print('Open');
}
else{
print('Closed');
}
}
});
By the way, I think you should use UTC, because if you change the time-zone on your device, your app is going to show that the shop is closed, but in fact the shop is open, just you are in a different time-zone. You can easily implement this with this code.
var now = DateTime.now().toUtc();
Maybe you can create a hash map like this:
hashMap=['12 AM', '1 AM', '2 AM', ... , '11 PM', '12 AM'];
After that you can get the positions of startTime, closeTime and actualTime, and see if the actualTime is between start and close times positions.
Let me know if you want to give you a code example.

Daily_schedule triggered runs and backfill runs have different date partition

I have #daily_schedule triggered daily at 3 minutes past 12am
When triggered by the scheduled tick at '2021-02-16 00:03:00'
The date input shows '2021-02-15 00:00:00', partition tagged as '2021-02-15'
While if triggered via backfill for partition '2021-02-16'
The date input shows '2021-02-16 00:00:00', partition tagged as '2021-02-16'
Why does the scheduled tick fill the partition a day before? Is there an option to use the datetime of execution instead (without using cron #schedule)? This descrepency is confusing when I perform queries using the timestamp for exact dates
P.S I have tested both scheduled run and backfil run to have the same Timezone.
#solid()
def test_solid(_, date):
_.log.info(f"Input date: {date}")
#pipeline()
def test_pipeline():
test_solid()
#daily_schedule(
pipeline_name="test_pipeline",
execution_timezone="Asia/Singapore",
start_date=START_DATE,
end_date=END_DATE,
execution_time=time(00, 03),
# should_execute=four_hourly_fitler
)
def test_schedule_daily(date):
timestamp = date.strftime("%Y-%m-%d %X")
return {
"solids": {
"test_solid":{
"inputs": {
"date":{
"value": timestamp
}
}
}
}
}
Sorry for the trouble here - the underlying assumption that the system is making here is that for schedules on pipelines that are partitioned by date, you don't fill in the partition for a day until that day has finished (i.e. the job filling in the data for 2/15 wouldn't run until the next day on 2/16). This is a common pattern in scheduled ETL jobs, but you're completely right that it's not a given that all schedules will want this behavior, and this is good feedback that we should make this use case easier.
It is possible to make a schedule for a partition in the way that you want, but it's more cumbersome. It would look something like this:
from dagster import PartitionSetDefinition, date_partition_range, create_offset_partition_selector
def partition_run_config(date):
timestamp = date.strftime("%Y-%m-%d %X")
return {
"solids": {
"test_solid":{
"inputs": {
"date":{
"value": timestamp
}
}
}
}
}
test_partition_set = PartitionSetDefinition(
name="test_partition_set",
pipeline_name="test_pipeline",
partition_fn=date_partition_range(start=START_DATE, end=END_DATE, inclusive=True, timezone="Asia/Singapore"),
run_config_fn_for_partition=partition_run_config,
)
test_schedule_daily = (
test_partition_set.create_schedule_definition(
"test_schedule_daily",
"3 0 * * *",
execution_timezone="Asia/Singapore",
partition_selector=create_offset_partition_selector(lambda d:d.subtract(minutes=3)),
)
)
This is pretty similar to #daily_schedule's implementation, it just uses a different function for mapping the schedule execution time to a partition (subtracting 3 minutes instead of 3 minutes and 1 day - that's the create_offset_partition_selector part).
I'll file an issue for an option to customize the mapping for the partitioned schedule decorators, but something like that may unblock you in the meantime. Thanks for the feedback!
Just an update on this: We added a 'partition_days_offset' parameter to the 'daily_schedule' decorator (and a similar parameter to the other schedule decorators) that lets you customize this behavior. The default is still to go back 1 day, but setting partition_days_offset=0 will give you the behavior you were hoping for where the execution day is the same as the partition day. This should be live in our next weekly release on 2/18.

Reliable method to check actual size occupied by data in hot cache

I have a table with 1 day of hot cache policy on it. And with that assume that cache utilization of the ADX cluster is less than 80%. Considering that, what would be a reliable method to exactly know the amount of cache space (TB) actually occupied by the table? I came up with the following two methods but they both return significantly different numbers:-
.show table <TableName> extents hot | summarize sum(ExtentSize)/pow(1024,4)
.show table <TableName> extents | where MaxCreatedOn >= ago(1d) | summarize extent_size=sum(ExtentSize) | project size_in_TB=((extent_size)/pow(1024,4))
The second command returns count more than 10 times higher than the first one. How can it be that different?
Both commands you ran should result with the same value, assuming:
you ran them at the same time (or quickly one after the other)
the effective caching policy is indeed 1 day (have you verified that is indeed the case?)
Regardless - the most efficient way to get that data point is by using the following command:
.show table TABLENAME details
| project HotExtentSizeTb = HotExtentSize/exp2(40), CachingPolicy
Here's an example from a table of mine, which has a caching policy of 4 days (set at table level), and a retention policy with a soft delete period of 3650 days:
// option 1
// --------
.show table yonis_table extents hot
| summarize HotExtentSizeTb = sum(ExtentSize)/exp2(40)
// returns: HotExtentSizeTb: 0.723723856871402 <---
// option 2: least efficient
// -------------------------
.show table yonis_table extents
| where MaxCreatedOn >= ago(4d)
| summarize HotExtentSizeTb = sum(ExtentSize)/exp2(40)
// returns: HotExtentSizeTb: 0.723723856871402 <---
// option 3: most efficient
// ------------------------
.show table yonis_table details
| project HotExtentSizeTb = HotExtentSize/exp2(40), CachingPolicy, RetentionPolicy
// returns:
HotExtentSizeTb: 0.723723856871402, <---
CachingPolicy: {
"DataHotSpan": "4.00:00:00"
},
RetentionPolicy: {
"SoftDeletePeriod": "3650.00:00:00",
"Recoverability": "Enabled"
}

ElasticSearch. How to get counts for several ranges in one query?

Currently I am getting count for a range of values via this query:
$ curl -XGET 'http://localhost:9200/myindex/mytype/_count' -d '{
range:{myfield:{gt:"start_val",lt:"end_val"}}
}
'
Now I have several ranges, and need counts for each range. Can I get them with one query, rather then re-querying each time?
I looked into multi-search with search_type=count But probably it's not the right approach to follow... (it gave me just some aggregated count rather than grouping by... looks like I misused it)
EDIT: I've found that range facet would have been amazing, but unfortunately my values are neither numbers, nor dates, they're just strings...
EDIT2: This is what I ended up with, based on the accepted answer:
$ curl -XGET 'http://localhost:9200/myindex/mytype/_search?search_type=count' -d '{
facets : {
range1 : {
filter : {
range:{myfield:{gt:"start_val1",lt:"end_val1"}}
}
},
range2 : {
filter : {
range:{myfield:{gt:"start_val2",lt:"end_val2"}}
}
}
}
}
'
Heres a link where the creator of ES gives a solution (i.e : several filter facets)
solution
So, one filter facet per range should work alright.
Here's the link toward the doc :
doc api
hope it helps

Resources