Custom Operator States (queued, success, etc.) in Apache Airflow? - airflow

In Apache Airflow (2.x), each Operator Instance has a state as defined here (airflow source repo).
I have two use cases that don't seem to clearly fall into the pre-defined states:
Warn, but don't fail - This seems like it should be a very standard use case and I am surprised to not see it in the out-of-the-box airflow source code. Basically, I'd like to color-code a node with something eye-catching - say orange - corresponding to a non-fatal warning, but continue execution as normal otherwise. Obviously you can print warnings to the log, but finding them takes more work than just looking at the colorful circles on the DAGs page.
"Sensor N/A" or "Data not ready" - This would be a status that gets assigned when a sensor notices that data in the source system is not yet ready, and that downstream operators can be skipped until the next execution of the DAG, but that nothing in the data pipeline is broken. Basically an expected end-of-branch.
Is there a good way of achieving either of these use cases with the out-of-the-box Airflow node states? If not, is there a way to defining custom operator states? Since I am running airflow on a managed service (MWAA), I don't think changing the source code of our deployment is an option.
Thanks,

The task states are tightly integrated with Airflow. There's no way to configure which logging levels lead to which state. I'd say the easiest way is to grep log files for "WARNING" or set up a log aggregation service e.g. Elasticsearch to make log files searchable.
For #2, sensors have no knowledge about why a sensor timed out. After timeout or execution_timeout is reached, they simply raise an Exception. You can deal with exceptions using trigger_rules, but these still don't take the reason for an exception into account.
If you want more control over this, I would implement your own Sensor which takes an argument e.g. data_not_ready_timeout (which is smaller than timeout and execution_timeout). In the poke() method, check if data_not_ready_timeout has been reached, and raise an AirflowSkipException if so. This will skip downstream tasks. Once timeout or execution_timeout are reached, the task is failed. Look at BaseSensorOperator.execute() for some inspiration to get the initial starting date of a sensor.

Related

Finding out whether a DAG execution is a catchup or a regularly scheduled one

I have an Airflow pipeline that starts with a FileSensor that may perform a number of retries (which makes sense because the producing process sometimes takes longer, and sometimes simply fails).
However when I restart the pipeline, as it runs in catchup mode, the retries in the file_sensor become spurious: if the file isn't there for a previous day, it wont materialize anymore.
Therefore my question: is it possible to make the behavior of a DAG-run contingent on whether that is currently running in a catch up or in a regularly scheduled run?
My apologies if this is a duplicated question: it seems a rather basic problem, but I couldn't find previous questions or documentation.
The solution is rather simple.
Set a LatestOnlyOperator upstream from the FileSensor
Set an operator of any type you may need downstream from the FileSensor with its trigger rule set to TriggerRule.ALL_DONE.
Both skipped and success states count as "done" states, while an error state doesn't. Hence, in a "non-catch-up" run the FileSensor will need to succeed to give way to the downstream task, while in a catch-up run the downstream will right away start after skipping the FileSensor.

Airflow subdag shows success while its tasks are skipped

I have a subdag that uses a sensor operator which contains a soft_fail=true , in order to skip instead of failing the task.
It works well, except that the status of the whole subdag is shown as "succeeded" instead of "skipped" which could be misleading when monitoring the flow, as I wouldn't know if the file has been found, or simply skipped. Any thoughts on how to make the subdag status inherit the subtasks' status?
A "skipped" status isn't a failure though, you requested not to execute a task and it did just that. Also think about what the opposite would be, a user being surprised that their run had failed just because Airflow did as they asked and skipped all the tasks.
This issue regarding the skipped status has been covered before. For example, it was reported in 1.8.0 and fixed in 1.8.1, but in later versions this fix was not propagated.
You could open an issue and request the change by selecting Reference in new issue in the three dots of this link.

Can Airflow persist access to metadata of short-lived dynamically generated tasks?

I have a DAG that, whenever there are files detected by FileSensor, generates tasks for each file to (1) move the file to a staging area, (2) trigger a separate DAG to process the file.
FileSensor -> Move(File1) -> TriggerDAG(File1) -> Done
|-> Move(File2) -> TriggerDAG(File2) -^
In the DAG definition file, the middle tasks are generated by iterating over the directory that FileSensor is watching, a bit like this:
# def generate_move_task(f: Path) -> BashOperator
# def generate_dag_trigger(f: Path) -> TriggerDagRunOperator
with dag:
for filepath in Path(WATCH_DIR).glob(*):
sensor_task >> generate_move_task(filepath) >> generate_dag_trigger(filepath)
The Move task moves the files that lead to the task generation, so the next DAG run won't have FileSensor re-trigger either Move or TriggerDAG tasks for this file. In fact, the scheduler won't generate the tasks for this file at all, since after all files go through Move, the input directory has no contents to iterate over anymore..
This gives rise to two problems:
After execution, the task logs and renderings are no longer available. The Graph View only shows the DAG as it is now (empty), not as it was at runtime. (The Tree View shows that the tasks' run and state, but clicking on the "square" and picking any details leads to an Airflow error.)
The downstream tasks can be memory-holed due to a race condition. The first task is to move the originating file to a staging area. If that takes longer than the scheduler polling period, the scheduler no longer collects the downstream TriggerDAG(File1) task, which means that task is not scheduled to be executed even though the upstream task ran successfully. It's as if the downstream task never existed.
The race condition issue is solved by changing the task sequence to Copy(File1) -> TriggerDAG(File1) -> Remove(File1), but the broader problem remains: is there a way to persist dynamically generated tasks, or at least a way to consistently access them through the Airflow interface?
While it isn't clear, i'm assuming that downstream DAG(s) that you trigger via your orchestrator DAG are NOT dynamically generated for each file (like your Move & TriggerDAG tasks); in other words, unlike your Move tasks that keep appearing and disappearing (based on files), the downstream DAGs are static and stay there always
You've already built a relatively complex workflow that does advanced stuff like generating tasks dynamically and triggering external DAGs. I think with slight modification to your DAGs structure, you can get rid of your troubles (which also are quite advanced IMO)
Relocate the Move task(s) from your upstream orchestrator DAG to the downstream (per-file) process DAG(s)
Make the upstream orchestrator DAG do two things
Sense / wait for files to appear
For each file, trigger the downstream processing DAG (which in effect you are already doing).
For the orchestrator DAG, you can do it either ways
have a single task that does file sensing + triggering downstream DAGs for each file
have two tasks (I'd prefer this)
first task senses files and when they appear, publishes their list in an XCOM
second task reads that XCOM and foreach file, triggers it's corresponding DAG
but whatever way you choose, you'll have to replicate the relevant bits of code from
FileSensor (to be able to sense file and then publish their names in XCOM) and
TriggerDagRunOperator (so as to be able to trigger multiple DAGs with single task)
here's a diagram depicting the two tasks approach
The short answer to the title question is, as of Airflow 1.10.11, no, this doesn't seem possible as stated. To render DAG/task details, the Airflow webserver always consults the DAGs and tasks as they are currently defined and collected to DagBag. If the definition changes or disappears, tough luck. The dashboard just shows the log entries in the table; it doesn't probe the logs for prior logic (nor does it seem to store much of it other than the headline).
y2k-shubham provides an excellent solution to the unspoken question of "how can I write DAGs/tasks so that the transient metadata are accessible". The subtext of his solution: convert the transient metadata into something Airflow stores per task run, but keep the tasks themselves fixed. XCom is the solution he uses here, and it does shows up in the task instance details / logs.
Will Airflow implement persistent interface access to fleeting one-time tasks whose definition disappears from the DagBag? It's possible but unlikely, for two reasons:
It would require the webserver to probe the historical logs instead of just the current DagBag when rendering the dashboard, which would require extra infrastructure to keep the web interface snappy, and could make the display very confusing.
As y2k-shubham notes in a comment to another question of mine, fleeting and changing tasks/DAGs are an Airflow anti-pattern. I'd imagine that would make this a tough sell as the next feature.

Airflow Dependencies Blocking Task From Getting Scheduled

I have an airflow instance that had been running with no problem for 2 months until Sunday. There was a blackout in a system on which my airflow tasks depend and some tasks where queued for 2 days. After that we decided it was better to mark all the tasks for that day as failed and just lose that data.
Nevertheless, now all the new tasks get trigger at the proper time but they are never being set to any state (neither queued nor running). I check the logs and I see this output:
Dependencies Blocking Task From Getting Scheduled
All dependencies are met but the task instance is not running. In most cases this just means that the task will probably be scheduled soon unless:
The scheduler is down or under heavy load
The following configuration values may be limiting the number of queueable processes: parallelism, dag_concurrency, max_active_dag_runs_per_dag, non_pooled_task_slot_count
This task instance already ran and had its state changed manually (e.g. cleared in the UI)
I get the impression the 3rd topic is the reason why it is not working.
The scheduler and the webserver were working, however I restarted the scheduler and still I am having the same outcome. I also deleted the data in mysql database for one job and it is still not running.
I also saw a couple of post that said it is not running because the depens_on_past was set to true and if the previous runs failed, the next one will never be executed. I also checked it and it is not my case.
Any input would be really apreciated.
Any ideas? Thanks
While debugging a similar issue i found this setting: AIRFLOW__SCHEDULER__MAX_DAGRUNS_PER_LOOP_TO_SCHEDULE (or http://airflow.apache.org/docs/apache-airflow/2.0.1/configurations-ref.html#max-dagruns-per-loop-to-schedule), checking the airflow code it seems that the scheduler queries for dagruns to examine (consider to run ti's for), this query is limited to that number of rows (or 20 by default). So if you have >20 dagruns that are in some way blocked (in our case because ti's were on up-for-retry), then it won't consider other dagruns even though these could run fine.

Airflow Cluster Policy not taking effect

I'm attempting to use a Cluster Policy in Airflow 1.9. I followed the instructions in the official documentation, but it doesn't seem to be taking effect.
In my file at $AIRFLOW_HOME/config/airflow_local_settings.py, I've defined the method as the docs instructed and it has the following signature:
def policy(task_instance):
Additional concerns:
What Airflow component is actually running the policy code (is it the scheduler)?
Is there a recommended way to unit test cluster policy code? If not, what about local testing?
Can anyone help me understand why this Cluster Policy isn't taking effect?
I'm using Airflow 1.9.
So you seem to have the file in the right place according to the documents: https://github.com/apache/airflow/blob/master/docs/concepts.rst#where-to-put-airflow_local_settingspy
And your signature is right: https://airflow.apache.org/docs/stable/concepts.html#mutate-tasks-after-dag-loaded
But you haven't shown what you did and how that "did not work".
I believe the def policy(task): signature is run on the scheduler after DAG parsing (as the docs seem to say) while the def task_instance_mutation_hook(ti): signature is run by the task executor on the worker. That's probably why you're not seeing some changes take.
EG timeout or queue is something the scheduler enforces, but connection ID is something the worker needs to know during execution.
So if what you wanted to work was a timeout policy, it should have, but if what you wanted to work was a connection ID enforcement, that wouldn't have.

Resources