Multiple Airflow XCOMs - airflow

I am pushing multiple values to XCOM based on values returned from a database. As the number of values returned may vary, I am using the index as the key.
How do I, in the next task, retrieve all the values from the previous task. Currently I am only returning the last XCOM from t1 but would like all of them.

Here is the source code for xcom_pull.
You'll see it had some filter logic and defaulting behaviour. I believe you are doing xcom_pull()[-1] or equivalent. But you can use task_ids argument to provide a list in order of the explicit task_ids that you want to pull xcom data from. Alternatively, you can use the keys that you push the data up with.
So in your case, where you want all the data emitted from the last task_instance and that alone, you just need to pass the task_id of the relevant task to the xcom_pull method.

Related

Airflow xcom return a string in list format instead of just a string value?

I have an airflow operator that returns a string value and the task is named 'task1'. So after execution I go into xcom and check the return_value and its just a string (screenshot below).
xcom key/value screenshot
Then I have an operator that follows named task2 takes an input from the xcom value from task1 like below:
"{{ ti.xcom_pull(task_ids=['task1'],key='return_value')}}"
The problem is the value it gets is a list converted to a string.
Value in xcom: this is just a string
Value returned by xcom pull (jinga template version): ['this is just a string']
So is there a way I can update xcom pull (jinga version) shown above to just pull the value? I don't have access inside the operator its being passed into or I could put some logic to convert string to a list then get the value only (but that would not be ideal and is not an option anyway).
Also, I think its work mentioning I tried to do something similar but using the Python operator then doing an xcom pull using the content inside the python code and the value was returned just fine. So I'm not sure why the xcom pull using the Jinja Templating does this and how I can get around this? I'm hoping there is something I can do that I'm not aware of to easily get the output I want. The python operator code that works is below (just as an fyi...)
def python_code_task3(**context):
value = context['ti'].xcom_pull(task_ids='task1', key='return_value')
logging.info("Value: " + value)
And this code just outputs the value like I want this is just a string
I really just want to use the jinga template version and have it retrieve and pass in the string. Not a string representation of a list with the string value as one item in the list.
There is a slight difference between the two ways you are pulling the XCom in your code snippets: one has task_ids=["task_1"] (a list arg) while the other has task_ids="task_1" (a str arg).
The argument type of task_ids matters when using xcom_pull(). Airflow will infer that if you pass a list of task IDs, there should be multiple tasks to pull XComs from and will return a list containing all of the retrieved XComs. Otherwise if the type is simply a string aka a single task ID, a single XCom value is returned. Here is a link to the code where this is done.
It's also worth noting that Jinja-templated values are rendered as strings by default. However, with Airflow 2.1 you can set a parameter called render_template_as_native_obj to True at the DAG level. This will now render Jinja-templated values as native Python objects (list, dict, etc.) when applicable. More info on this concept here.

Parsing nested JSON data within a Kusto column

After parsing the JSON data in a column within my Kusto Cluster using parse_json, I'm noticing there is still more data in JSON format nested within the resulting projected value. I need to access that information and make every piece of the JSON data its own column.
I've attempted to follow the answer from this SO post (Parsing json in kusto query) but haven't been successful in getting the syntax correct.
myTable
| project
Time,
myColumnParsedJSON = parse_json(column)
| project myColumnParsedNestedJSON = parse_json(myColumnParsedJSON.nestedJSONDataKey)
I expect the results to be projected columns, each named as each of the keys, with their respective values displayed in one row record.
please see the note at the bottom of this doc:
It is somewhat common to have a JSON string describing a property bag in which one of the "slots" is another JSON string. In such cases, it is not only necessary to invoke parse_json twice, but also to make sure that in the second call, tostring will be used. Otherwise, the second call to parse_json will simply pass-on the input to the output as-is, because its declared type is dynamic
once you're able to get parse_json to properly have your payload parsed, you could use the bag_unpack plugin (doc) in order to achieve this requirement you mentioned:
I expect the results to be projected columns, each named as each of the keys, with their respective values displayed in one row record.

Automatic rollback of implicit transaction(s) for multiple statements?

When multiple statements are submitted together --separated by semicolons(;) but in the same string-- and are NOT wrapped in an explicit transaction, is only a single implicit transaction created or is an implicit transaction created for each statement separately? Further, if one of the later statements fail and an automatic rollback is performed, are all of the statements rolled back?
This other answer almost satisfies my question, but wording in the official documentation leaves me puzzled. In fact, this may seem like a duplicate, but I am specifically wondering about implicit transactions for multiple statements. The other answer does not explicitly address this particular case.
As an example (borrowing from the other question), the following are submitted as a single string:
INSERT INTO a (x, y) VALUES (0, 0);
INSERT INTO b (x, y) VALUES (1, 2); -- line 3 error here, b doesn't have column x
The documentation says
Automatically started transactions are committed when the last query finishes. (emphasis added)
and
An implicit transaction (a transaction that is started automatically, not a transaction started by BEGIN) is committed automatically when the last active statement finishes. A statement finishes when its prepared statement is reset or finalized. (emphasis added)
The keyword last implies to me the possibility of multiple statements. Of course if an implicit transaction is started for each individual statement, then taken individually each statement will be the "last" statement to be executed, but in context of individual statements then it should just say the statement to emphasize the context being one single statement at a time.
Or is there there a difference between prepared statements and unprepared SQL strings? (But as I understand, all statements are prepared even if the calling application doesn't preserve the prepared statement for reuse, so I'm not sure this even matters.)
In the case of all statements being successful, the result of a single commit or multiple commits are essentially the same, but the docs only mention that the single failing statement is automatically rolled back, but doesn't mention other statements submitted together.
The sqlite3_prepare interface compiles the first SQL statement in a query string. The pzTail parameter to these functions returns a pointer to the beginning of the unused portion of the query string.
For example, if you call sqlite3_prepare with the multi-statement SQL string in your example, the first statement is the only one that is active for the resulting prepared statement. The pzTail pointer, if provided, points to the beginning of the second statement. The second statement is not compiled as a prepared statement until you call sqlite3_prepare again with the pzTail pointer.
So, no, multiple statements are not rolled back. Each implicit transaction created by the SQLite engine encompasses a single prepared statement.

Airflow - Variables among task

How do I create a variable in DAG level and pass on to multiple task?
For example :
cluster_name = 'data-' + datetime.now().strftime("%Y-%m-%d-%H-%M-%S-%f")
I have to use the above variable cluster_name in all task. but I see value keep changing. I do not want to use xcom. Please advise
This value will change all the time because the DAG definition is being parsed repeatedly by the scheduler/webserver/workers, and datetime.now() will return different values every time it is parsed.
I highly recommend against using dynamic task names.
The date is already part of a task in the sense that the execution date is part of what makes each run of the task unique.
Each task instance can be identified by: dag_id + task_id + execution_date
To uniquely identify the tasks, use these things instead of bundling the date inside the name.
You can store it in an Airflow Variable and it should be accessible to all your tasks. Just note that it is a database call each time you look up a Variable.

DynamoDB - Return which ConditionExpression was false

I am calling PutItem with a ConditionExpression that looks like:
attribute_exists(id) AND object_version = :x
In other words, I only want to update an item if the following conditions are true:
The object needs to exists
My update must be on latest version of the object
Right now, if the check fails, I don't know which condition was false. Is there a way to get information on which conditions were false? Probably not but who knows...
Conditional expressions in DynamoDB allow for atomic write operations for DynamoDB objects that is strongly consistent for a single object, even in a distributed system thanks to paxos.
One standard approach is to simply read the object first and perform your above check in your client application code. If one of the conditions doesn't match you know which one was invalid directly without a failed write operation. The reason for having DynamoDB also perform this check is because another application or thread may have modified this object between the check and write. If the write failed then you would read the object again and perform the check again.
Another approach would be to skip the read before the write and just read the object after the failed write to do the check in your code to determine what condition actually failed.
The one or two additional reads against the table are required because you want to know which specific condition failed. This feature is not provided by DynamoDB so you'll have to do the check yourself.

Resources