Airflow Xcom: How to cast byte array for value into text or json text in SQL? - airflow

I'm investigating which data processing jobs are taking longer over their respective use over time (for installations of our system where it's been running for many months). The sizes of the data files it processes varies in size up to a few orders in magnitude, so I want to normalize the comparison between the processing times, and the number of records in the payload which is locked inside an XCOM variable value.
I would like to build a SQL view that I can use to correlate the processing duration (end-start), vs. file size vs. execution date, to see how stable the processing is over it's life cycle.
In documentation online, there's examples about serializing into JSON for Python, but, our metadata store for Airflow is in PostGres, and I want to create a SQL view that can provide metrics that associated statistics from processing the dags/tasks and associate metadata from the processing itself nested inside XCOM values.
Does anyone now how to cast XCOM byte value into something parseable in PostGres SQL?

I'm facing the same issue. After digging through airflow source, found this:
https://github.com/apache/airflow/blob/2bea3d74952d0d68d90e8bbc307ac3dfe8fcf2ff/airflow/models/xcom.py#L221]
When setting an XCOM variable in the database it will serialize it. In airflow.cfg there is a setting enable_xcom_pickling = True.
if conf.getboolean('core', 'enable_xcom_pickling'):
return pickle.dumps(value)
Looks like the byte array is getting pickled and then stored. This is annoying because I don't think there is a way to unpickle the byte array straight from postgres.
There is also another flag you can set called donot_pickle = False. Not sure what this does yet - looking into it more

Related

What is the best way to schedule tasks in a serverless stack?

I am using NextJS and Firebase for an application. The users are able to rent products for a certain period. After that period, a serverless function should be triggered which updates the database etc. Since NextJS is event-driven I cannot seem to figured out how to schedule a task, which executes when the rental period ends and the database is updated.
Perhaps cron jobs handled elsewhere (Easy Cron etc) are a solution. Or maybe an EC2 instance just for scheduling these tasks.
Since this is marked with AWS EC2, i've assumed it's ok to suggest a solution with AWS services in mind.
What you could do is leverage DynamoDB's speed & sort capabilities. If you specify a table with both the partition key and the range key, the data is automatically sorted in the UTF-8 order. This means iso-timestamp values can be used to sort data historically.
With this in mind, you could design your table to have a partition key of a global, constant value across all users (to group them all) and a sort key of isoDate#userId, while also creating an GSI (Global Secondary Index) with the userId as the partition key, and the isoDate as the range key.
With your data sorted, you can use the BETWEEN query to extract the entries that fit to your time window.
Schedule 1 lambda to run every minute (or so) and extract the entries that are about to expire to notify them about it.
Important note: This sorting method works when ALL range keys have the same size, due to how sorting with the UTF-8 works. You can easily accomplish this if your application uses UUIDs as ids. If not, you can simply generate a random UUID to attach to the isoTimestamp, as you only need it to avoid the rare exact time duplicity.
Example: lets say you want to extract all data from expiring near the 2022-10-10T12:00:00.000Z hour:
your query would be BETWEEN 2022-10-10T11:59:00.000Z#00000000-0000-0000-0000-000000000000 and 2022-10-10T12:00:59.999Z#zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz
Timestamps could be a little off, but you get the idea. 00.. is the start UTF8 of an UUID, and zz.. (or fff..) is the end.
In AWS creating periodic triggers to Lambda using AWS Console is quite simple and straight-forward.
Login to console and navigate to CloudWatch.
Under Events, select Rules & click “Create Rule”
You can either select fixed rate or select Cron Expression for more control
Cron expression in CloudWatch starts from minutes not seconds, important to remember if you are copying Cron expression from somewhere else.
Click “Add Target”, select “Lambda Function” from drop down & then select appropriate Lambda function.
If you want to pass some data to the target function when triggered, you can do so by expanding “Configure Input”

How to periodically update a moderate amount of data (~2.5m entries) in Google Datastore?

I'm trying to do the following periodically (lets say once a week):
download a couple of public datasets
merge them together, resulting in a dictionary (I'm using Python) of ~2.5m entries
upload/synchronize the result to Cloud Datastore so that I have it as "reference data" for other things running in the project
Synchronization can mean that some entries are updated, others are deleted (if they were removed from the public datasets) or new entries are created.
I've put together a python script using google-cloud-datastore however the performance is abysmal - it takes around 10 hours (!) to do this. What I'm doing:
iterate over the entries from the datastore
look them up in my dictionary and decide if the need update / delete (if no longer present in the dictionary)
write them back / delete them as needed
insert any new elements from the dictionary
I already batch the requests (using .put_multi, .delete_multi, etc).
Some things I considered:
Use DataFlow. The problem is that each tasks would have to load the dataset (my "dictionary") into memory which is time and memory consuming
Use the managed import / export. Problem is that it produces / consumes some undocumented binary format (I would guess entities serialized as protocol buffers?)
Use multiple threads locally to mitigate the latency. Problem is the google-cloud-datastore library has limited support for cursors (it doesn't have an "advance cursor by X" method for example) so I don't have a way to efficiently divide up the entities from the DataStore into chunks which could be processed by different threads
How could I improve the performance?
Assuming that your datastore entities are only updated during the sync, then you should be able to eliminate the "iterate over the entries from the datastore" step and instead store the entity keys directly in your dictionary. Then if there are any updates or deletes necessary, just reference the appropriate entity key stored in the dictionary.
You might be able to leverage multiple threads if you pre-generate empty entities (or keys) in advance and store cursors at a given interval (say every 100,000 entities). There's probably some overhead involved as you'll have to build a custom system to manage and track those cursors.
If you use dataflow, instead of loading in your entire dictionary you could first import your dictionary into a new project (a clean datastore database), then in your dataflow function you could load the key given to you through dataflow to the clean project. If the value comes back from the load, upsert that to your production project, if it doesn't exist, then delete the value from your production project.

What happens when 5 second execution time limit exceeds in Azure DocumentDb Stored Procedures

I have a read operation that reads a lot of records from a DocumentDb collection and when executed it will run for a long time. I am writing a stored procedure to move that query to the server-side. I understand that documentdb stored procedures have a execution cap of 5 seconds. What i wanna know is that in a read operation what happens when the query execution hits that time limit. Can i add some kind of a retry logic to continue after some time or will i have to do the read from the beginning?
This is not a problem if you follow this simple pattern when writing your stored procedures and you keep calling the stored procedure until continuation comes back null.
The key help here is that you are given some buffer beyond the 5 seconds to wrap up your stored procedure before it's forcedly shut down. Whenever the sproc is about to be shut down, the most recent database operation will return false instead of true. DocumentDB gives you enough time to process the last batch returned.
For read/query operations (example countDocuments), the key element to the recommended pattern is to store the continuation token for your read/query operation in the body that's returned from your stored procedure. You can set the body as many times as you want. Only the last one will be returned when the stored procedure either exist gracefully when resource limits are reached or whenever the stored procedure's job is done.
For write operations (example createVariedDocuments), documentdb-utils still looks at the continuation that's returned to decide if the sproc has finished its work except in this case, it won't be a read/query continuation and its value doesn't matter. It's simply an indicator for whether or not you need to call the sproc again. That's why I set it to "Value does not matter" in my example. Anything other than null would work.
Key off of the continuation that's returned from the stored procedure execution to decide whether or not to call it again. Documentdb-utils will automatically keep calling your stored procedure until continuation comes back null but you can implement this yourself. Documentdb-utils also includes a number of example sprocs that implement this pattern for you to riff off of. Documentdb-lumenize utilizes this pattern to the nth degree to implement an aggregation engine running inside of a sproc.
Disclosure: I'm the author of documentdb-utils and documentdb-lumenize.

Meta-data from SQLite

Is there any way to query a SQLite database for basic meta data such as:
Last date/time updated
Hash of database to indicate "state"
I am just looking for a simple, infrastructural way to have a script evaluate different databases and take a reasonable point of view on whether they are the same "state" as other databases in a different environment (PROD and DEV for instance).
In my experience, if no update, new record, or any change is made to the SQLite database file, the last modified time of the file doesn't change. So the last modified time should suffice for the time of any change made to database.
If 2 database files with same state are only accessed for reading, their modified times are always the same.
Similarly you get the file sizes for comparison.
You can use the whole file to calculate hash. If you consider same data in the database as the same "state" regardless of any difference in the past, then maybe you want hash of the all records in database, which is probably not simple.

Is it possible and meaningful to execute an Array DML command containing BLOB data?

Is it possible to execute an Array DML INSERT or UPDATE statement passing a BLOB field data in the parameter array ? And the more important part of my question, if it is possible, will Array DML command containing BLOB data still be more efficient than executing commands one by one ?
I have noticed that TADParam has a AsBlobs indexed property so I assume it might be possible, but I haven't tried this yet because there's no mention of performance nor example showing this and because the indexed property is of type RawByteString which is not much suitable for my needs.
I'm using FireDAC and working with SQLite database (Params.BindMode = pbByNumber, so I'm using native SQLite INSERT with multiple VALUES). My aim is to store about 100k records containing pretty small BLOB data (about 1kB per record) as fast as possible (in cost of the FireDAC's abstraction).
The main point in your case is that you are using a SQLIte3 database.
With SQLite3, Array DML is "emulated" by FireDAC. Since it is a local instance, not a client-server instance, there is no need to prepare a bunch of rows, then send them at once to avoid network latency (as with Oracle or MS SQL).
Using Array DML may speed up your insertion process a little bit with SQLite3, but I doubt it will very high. Good plain INSERT with binding per number will work just fine.
The main tips about performance in your case will be:
Nest your process within a single transaction (or even better, use one transaction per 1000 rows of data);
Prepare an INSERT statement, then re-execute it with a bound parameter each time;
By default, FireDAC initialize SQLite3 with the fastest options (e.g. disabling LOCK), so let it be.
SQlite3 is very good about BLOB process.
From my tests, FireDAC insertion timing is pretty good, very close to direct SQlite3 access. Only reading is slower than a direct SQLite3 link, due to the overhead of the Delphi TDataSet class.

Resources