I have a 1GB JSON file I like to convert to CSV format. The file contains information about UK company people with significant control (PSC).
file source: http://download.companieshouse.gov.uk/en_pscdata.html
here is a data snippet of PSC Data product:
{"company_number":"04502074","data":{"address":{"address_line_1":"Grove Hall","address_line_2":"Ashbourne Green","locality":"Ashbourne","postal_code":"DE6 1JD","region":"Derbyshire"},"country_of_residence":"England","date_of_birth":{"month":11,"year":1964},"etag":"f9a632332f63b61f004569f99d6b15e3e6d28192","kind":"individual-person-with-significant-control","links":{"self":"/company/04502074/persons-with-significant-control/individual/34zTsx2BFGMyn0lJe2REL656U8w"},"name":"Mr Philip Anthony Donlan","name_elements":{"forename":"Philip","middle_name":"Anthony","surname":"Donlan","title":"Mr"},"nationality":"British","natures_of_control":["ownership-of-shares-50-to-75-percent"],"notified_on":"2016-04-06"}}
{"company_number":"10260075","data":{"address":{"country":"England","locality":"Widnes","postal_code":"WA8 9DH","premises":"1 Stockswell Farm Court"},"country_of_residence":"England","date_of_birth":{"month":12,"year":1978},"etag":"dbf13fc08cb9136450089681b6e9364eb8458129","kind":"individual-person-with-significant-control","links":{"self":"/company/10260075/persons-with-significant-control/individual/Br24rkYIl3ZKam3C9fT4o_9uF7k"},"name":"Mr Daniel Thomas Ross","name_elements":{"forename":"Daniel","middle_name":"Thomas","surname":"Ross","title":"Mr"},"nationality":"English","natures_of_control":["significant-influence-or-control"],"notified_on":"2016-07-01"}}
{"company_number":"SC539354","data":{"address":{"address_line_1":"5 West Victoria Dock Road","locality":"Dundee","postal_code":"DD1 3JT","premises":"Begbies Traynor (Central) Llp, River Court"},"country_of_residence":"Scotland","date_of_birth":{"month":4,"year":1980},"etag":"37599a22ede050072457db60af6e75ba8e237246","kind":"individual-person-with-significant-control","links":{"self":"/company/SC539354/persons-with-significant-control/individual/T7LPjXkKRuaunfMRjfFrnWiHEnI"},"name":"Mr Stuart Hemple","name_elements":{"forename":"Stuart","surname":"Hemple","title":"Mr"},"nationality":"British","natures_of_control":["ownership-of-shares-75-to-100-percent","voting-rights-75-to-100-percent","right-to-appoint-and-remove-directors","significant-influence-or-control"],"notified_on":"2016-07-01"}}
{"company_number":"02722495","data":{"address":{"address_line_1":"Beechdene","address_line_2":"108 Coventry Road","locality":"Warwick","postal_code":"CV34 5HH"},"country_of_residence":"England","date_of_birth":{"month":12,"year":1953},"etag":"610138d3809ab3237b609f3cb93bfe4bf89d7581","kind":"individual-person-with-significant-control","links":{"self":"/company/02722495/persons-with-significant-control/individual/8g4ED3usT4wLPEqra7dE97eqmHE"},"name":"Mr Marshall Fenn Stephenson","name_elements":{"forename":"Marshall","middle_name":"Fenn","surname":"Stephenson","title":"Mr"},"nationality":"British","natures_of_control":["ownership-of-shares-25-to-50-percent"],"notified_on":"2016-07-01"}}
{"company_number":"05495733","data":{"address":{"address_line_1":"Brompton Road","country":"England","locality":"London","postal_code":"SW3 2AS","premises":"253"},"country_of_residence":"Italy","date_of_birth":{"month":4,"year":1953},"etag":"d45e5d2aa905e769e6fd3aa364e301f73b047985","kind":"individual-person-with-significant-control","links":{"self":"/company/05495733/persons-with-significant-control/individual/Oqp-z-D5JTX0mjXTtmOqmct1vR4"},"name":"Mr Roberto Gavazzi","name_elements":{"forename":"Roberto","surname":"Gavazzi","title":"Mr"},"nationality":"Italian","natures_of_control":["ownership-of-shares-50-to-75-percent","voting-rights-50-to-75-percent","right-to-appoint-and-remove-directors-as-firm","significant-influence-or-control-as-firm"],"notified_on":"2016-06-30"}}
{"company_number":"SC539355","data":{"address":{"address_line_1":"6 Dryden Road","country":"Scotland","locality":"Loanhead","postal_code":"EH20 9LZ","premises":"Bilston Glen Business Centre"},"country_of_residence":"Scotland","date_of_birth":{"month":10,"year":1990},"etag":"b03abb8bb1b6f95039dd896210d7c231d8784c31","kind":"individual-person-with-significant-control","links":{"self":"/company/SC539355/persons-with-significant-control/individual/tYyjuJrp6Ifp327VxThVGeRswMM"},"name":"Mr David John Kelly","name_elements":{"forename":"David","middle_name":"John","surname":"Kelly","title":"Mr"},"nationality":"Scottish","natures_of_control":["ownership-of-shares-75-to-100-percent"],"notified_on":"2016-07-01"}}
{"company_number":"SC539356","data":{"address":{"address_line_1":"Scholes","country":"England","locality":"Wigan","postal_code":"WN1 1YF","premises":"106 Douglas House"},"country_of_residence":"England","date_of_birth":{"month":3,"year":1961},"etag":"2f15d0fbacc68763b00e203ab0820b0911ac5906","kind":"individual-person-with-significant-control","links":{"self":"/company/SC539356/persons-with-significant-control/individual/0eATs-Ecoj9ie0_pBCq29L6UtlM"},"name":"Mr Mark Edward Sowery","name_elements":{"forename":"Mark","middle_name":"Edward","surname":"Sowery","title":"Mr"},"nationality":"British","natures_of_control":["ownership-of-shares-75-to-100-percent"],"notified_on":"2016-07-01"}}
{"company_number":"07674942","data":{"address":{"address_line_1":"Old Gloucester Street","country":"England","locality":"London","postal_code":"WC1N 3AX","premises":"27"},"country_of_residence":"Sierra Leone","date_of_birth":{"month":2,"year":1979},"etag":"f2e9cc0033cd5ef24fcda06baeff06bb8ea72654","kind":"individual-person-with-significant-control","links":{"self":"/company/07674942/persons-with-significant-control/individual/D31C5Na0B1I4rqM1RYwpy3J8oKA"},"name":"Mr Muhammad Umar Babar","name_elements":{"forename":"Muhammad","middle_name":"Umar","surname":"Babar","title":"Mr"},"nationality":"Pakistani","natures_of_control":["ownership-of-shares-75-to-100-percent"],"notified_on":"2016-07-01"}}
{"company_number":"09639364","data":{"address":{"address_line_1":"Galmington Road","country":"United Kingdom","locality":"Taunton","postal_code":"TA1 5NP","premises":"58b","region":"Somerset"},"country_of_residence":"United Kingdom","date_of_birth":{"month":12,"year":1977},"etag":"25ff7d41f9b8f257f0d41ae82e88202017beff34","kind":"individual-person-with-significant-control","links":{"self":"/company/09639364/persons-with-significant-control/individual/qlPpucOQopiIgq1xzZIb6xjO5JQ"},"name":"Mr Li Ying Cao","name_elements":{"forename":"Li","middle_name":"Ying","surname":"Cao","title":"Mr"},"nationality":"British","natures_of_control":["ownership-of-shares-75-to-100-percent"],"notified_on":"2016-07-01"}}
{"company_number":"08541397","data":{"address":{"address_line_1":"Hedley Avenue","locality":"Blyth","postal_code":"NE24 3JP","premises":"27","region":"Northumberland"},"country_of_residence":"England","date_of_birth":{"month":11,"year":1949},"etag":"b843664ca67a4274ee6f6cc9816ab35cd8367190","kind":"individual-person-with-significant-control","links":{"self":"/company/08541397/persons-with-significant-control/individual/KuddC6fZH17ifaXSAVWEcC2ba74"},"name":"Mr David Harwood","name_elements":{"forename":"David","surname":"Harwood","title":"Mr"},"nationality":"British","natures_of_control":["ownership-of-shares-75-to-100-percent"],"notified_on":"2016-05-23"}}
{"company_number":"02832188","data":{"address":{"address_line_1":"Lodge Road","country":"England","locality":"London","postal_code":"NW4 4DD","premises":"1"},"country_of_residence":"England","date_of_birth":{"month":5,"year":1952},"etag":"0c21b2b560ee43ca0c2ffd9f07d5ca564536b6e2","kind":"individual-person-with-significant-control","links":{"self":"/company/02832188/persons-with-significant-control/individual/Rh8pb-L7fEZzkyhttuCwVjjL_eA"},"name":"Mrs Rachel Weissman","name_elements":{"forename":"Rachel","surname":"Weissman","title":"Mrs"},"nationality":"British","natures_of_control":["ownership-of-shares-25-to-50-percent","voting-rights-25-to-50-percent","right-to-appoint-and-remove-directors","significant-influence-or-control"],"notified_on":"2016-07-01"}}
I have created a input file in.json, contain my json data as providet by companies house: file source: http://download.companieshouse.gov.uk/en_pscdata.html
I have created a output file out.csv
I am trying to run the follwing code:
jq -r 'map({company_number,address_line_1,country,locality,postal_code,premises,ceased_on,country_of_residence,month,year,etag,kind}) | (first | keys_unsorted) as $keys | map([to_entries[] | .value]) as $rows | $keys,$rows[] | #csv' in.json > out.csv
im getting the following error:
jq: error (at in.json:0): Cannot index string with string "company_number"
please advise on what am I doing wrong and how to get this done.
Since you are selecting bits of data from different levels of the input objects, you will need to specify the selection more precisely.
As your input consists of a stream of JSON objects, let's start with a function for reading one of those objects:
# Input and output: a JSON object
def get:
{company_number} as $number
| .data
| (.address | {address_line_1,country,locality,postal_code,premises}) as $address
| {ceased_on,country_of_residence} as $details
| (.date_of_birth | {month, year}) as $dob
| $number + $address + $details + $dob + {etag,kind}
;
There are several ways to read JSON streams, but it's quite convenient to use use input and inputs with the -n command-line option.
To make things easy to read, let's next define another helper function for producing an array of the relevant data:
def getRow:
get | [.[]];
Putting it all together:
(input|get)
| keys_unsorted,
[.[]],
(inputs | getRow)
| #csv
Don't forget the -r and -n command-line options!
Footnote:
In general, using [.[]] to "flatten" a JSON object to a flat array of values is ill-advised, but in the present case, we have ensured a consistent ordering of keys in get, and it is reasonable to assume that none of the values in the selected fields are compound, as suggested by the snippet and the 500,000 records in one of the snapshot files. A "robustification" would, however, be trivial to achieve (e.g. using tostring), and might therefore be advisable.
If you were using gojq (the Go implementation of jq), you would have to do things slightly differently as gojq does not respect user-specified ordering of keys. You'd have to generate the header row differently and make minor changes to get.
In the Firestore project, I have documents in a collection containing data for shops, having fields like shopName, shopAddress, startTime(eg. 10 AM) and closeTime(eg. 10 PM) . (all strings for now)
When the user is browsing the app, i have retrieved the data from Firestore of the shops displayed in the app, now i wanna show that the shop is closed when the device's time is not between the startTime and closeTime of the shop. How do i achieve this?
So far I can detect the device's current time using dart package intl using this code:
print("${DateFormat('j').format(DateTime.now())}");
It gives output as follows:
I/flutter (14877): 6 PM
This is in DateFormat, and the data types stored in Firestore are strings.. I dont know how to compare them.. Do let me know if i have to change the data types in Firestore too.
Thank You
I think if you use 24 Hour Time Format and convert startTime, closeTime and actualTime to int or double ( if the shop close at 20:30/8:30pm), then you can easily compare them with if. On your firebase server string format is perfect.
For example you make a map and iterate it, and check if the actualTime is higher than startTime and lower than closeTime.
I have never tried this code, but i think it is going to work.
Map map = {'1am': 1, '2am': 2, '3am': 3, ... , '11pm': 23};
map.entries.forEach((e) {
if(e.key == actualTime) {
if(e.value >= startTime && e.value < closeTime) {
print('Open');
}
else{
print('Closed');
}
}
});
By the way, I think you should use UTC, because if you change the time-zone on your device, your app is going to show that the shop is closed, but in fact the shop is open, just you are in a different time-zone. You can easily implement this with this code.
var now = DateTime.now().toUtc();
Maybe you can create a hash map like this:
hashMap=['12 AM', '1 AM', '2 AM', ... , '11 PM', '12 AM'];
After that you can get the positions of startTime, closeTime and actualTime, and see if the actualTime is between start and close times positions.
Let me know if you want to give you a code example.
I have #daily_schedule triggered daily at 3 minutes past 12am
When triggered by the scheduled tick at '2021-02-16 00:03:00'
The date input shows '2021-02-15 00:00:00', partition tagged as '2021-02-15'
While if triggered via backfill for partition '2021-02-16'
The date input shows '2021-02-16 00:00:00', partition tagged as '2021-02-16'
Why does the scheduled tick fill the partition a day before? Is there an option to use the datetime of execution instead (without using cron #schedule)? This descrepency is confusing when I perform queries using the timestamp for exact dates
P.S I have tested both scheduled run and backfil run to have the same Timezone.
#solid()
def test_solid(_, date):
_.log.info(f"Input date: {date}")
#pipeline()
def test_pipeline():
test_solid()
#daily_schedule(
pipeline_name="test_pipeline",
execution_timezone="Asia/Singapore",
start_date=START_DATE,
end_date=END_DATE,
execution_time=time(00, 03),
# should_execute=four_hourly_fitler
)
def test_schedule_daily(date):
timestamp = date.strftime("%Y-%m-%d %X")
return {
"solids": {
"test_solid":{
"inputs": {
"date":{
"value": timestamp
}
}
}
}
}
Sorry for the trouble here - the underlying assumption that the system is making here is that for schedules on pipelines that are partitioned by date, you don't fill in the partition for a day until that day has finished (i.e. the job filling in the data for 2/15 wouldn't run until the next day on 2/16). This is a common pattern in scheduled ETL jobs, but you're completely right that it's not a given that all schedules will want this behavior, and this is good feedback that we should make this use case easier.
It is possible to make a schedule for a partition in the way that you want, but it's more cumbersome. It would look something like this:
from dagster import PartitionSetDefinition, date_partition_range, create_offset_partition_selector
def partition_run_config(date):
timestamp = date.strftime("%Y-%m-%d %X")
return {
"solids": {
"test_solid":{
"inputs": {
"date":{
"value": timestamp
}
}
}
}
}
test_partition_set = PartitionSetDefinition(
name="test_partition_set",
pipeline_name="test_pipeline",
partition_fn=date_partition_range(start=START_DATE, end=END_DATE, inclusive=True, timezone="Asia/Singapore"),
run_config_fn_for_partition=partition_run_config,
)
test_schedule_daily = (
test_partition_set.create_schedule_definition(
"test_schedule_daily",
"3 0 * * *",
execution_timezone="Asia/Singapore",
partition_selector=create_offset_partition_selector(lambda d:d.subtract(minutes=3)),
)
)
This is pretty similar to #daily_schedule's implementation, it just uses a different function for mapping the schedule execution time to a partition (subtracting 3 minutes instead of 3 minutes and 1 day - that's the create_offset_partition_selector part).
I'll file an issue for an option to customize the mapping for the partitioned schedule decorators, but something like that may unblock you in the meantime. Thanks for the feedback!
Just an update on this: We added a 'partition_days_offset' parameter to the 'daily_schedule' decorator (and a similar parameter to the other schedule decorators) that lets you customize this behavior. The default is still to go back 1 day, but setting partition_days_offset=0 will give you the behavior you were hoping for where the execution day is the same as the partition day. This should be live in our next weekly release on 2/18.