I have a subdag that uses a sensor operator which contains a soft_fail=true , in order to skip instead of failing the task.
It works well, except that the status of the whole subdag is shown as "succeeded" instead of "skipped" which could be misleading when monitoring the flow, as I wouldn't know if the file has been found, or simply skipped. Any thoughts on how to make the subdag status inherit the subtasks' status?
A "skipped" status isn't a failure though, you requested not to execute a task and it did just that. Also think about what the opposite would be, a user being surprised that their run had failed just because Airflow did as they asked and skipped all the tasks.
This issue regarding the skipped status has been covered before. For example, it was reported in 1.8.0 and fixed in 1.8.1, but in later versions this fix was not propagated.
You could open an issue and request the change by selecting Reference in new issue in the three dots of this link.
Related
I have an Airflow pipeline that starts with a FileSensor that may perform a number of retries (which makes sense because the producing process sometimes takes longer, and sometimes simply fails).
However when I restart the pipeline, as it runs in catchup mode, the retries in the file_sensor become spurious: if the file isn't there for a previous day, it wont materialize anymore.
Therefore my question: is it possible to make the behavior of a DAG-run contingent on whether that is currently running in a catch up or in a regularly scheduled run?
My apologies if this is a duplicated question: it seems a rather basic problem, but I couldn't find previous questions or documentation.
The solution is rather simple.
Set a LatestOnlyOperator upstream from the FileSensor
Set an operator of any type you may need downstream from the FileSensor with its trigger rule set to TriggerRule.ALL_DONE.
Both skipped and success states count as "done" states, while an error state doesn't. Hence, in a "non-catch-up" run the FileSensor will need to succeed to give way to the downstream task, while in a catch-up run the downstream will right away start after skipping the FileSensor.
As screenshot:
Any idea why this would occur, and how would I go about troubleshooting it?
--
Update:
It's "none status" not "queued" as I originally interpreted
The DAG run occurred on 3/8 and last relevant commit was on 3/1. But I'm having trouble finding the same DAG run....will keep investigating
It's not Queued status. It's None status.
This can happen in one of the following cases:
The task drop_staging_table_if_exists was added after create_staging_table started to run.
The task drop_staging_table_if_exists used to have a different task_id in the past.
The task drop_staging_table_if_exists was somewhere else in the workflow and you changed the dependencies after the DAG run started.
Note Airflow currently doesn't support DAG versioning (It will be supported in future versions when AIP-36 DAG Versioning is completed) This means that Airflow constantly reload the DAG structure, so changes that you make will also be reflected on past runs - This is by design! and it's very useful for cases where you want to backfill past runs.
Either way, if you will start a new run or clear this specific run the issue you are facing will be resolved.
In Apache Airflow (2.x), each Operator Instance has a state as defined here (airflow source repo).
I have two use cases that don't seem to clearly fall into the pre-defined states:
Warn, but don't fail - This seems like it should be a very standard use case and I am surprised to not see it in the out-of-the-box airflow source code. Basically, I'd like to color-code a node with something eye-catching - say orange - corresponding to a non-fatal warning, but continue execution as normal otherwise. Obviously you can print warnings to the log, but finding them takes more work than just looking at the colorful circles on the DAGs page.
"Sensor N/A" or "Data not ready" - This would be a status that gets assigned when a sensor notices that data in the source system is not yet ready, and that downstream operators can be skipped until the next execution of the DAG, but that nothing in the data pipeline is broken. Basically an expected end-of-branch.
Is there a good way of achieving either of these use cases with the out-of-the-box Airflow node states? If not, is there a way to defining custom operator states? Since I am running airflow on a managed service (MWAA), I don't think changing the source code of our deployment is an option.
Thanks,
The task states are tightly integrated with Airflow. There's no way to configure which logging levels lead to which state. I'd say the easiest way is to grep log files for "WARNING" or set up a log aggregation service e.g. Elasticsearch to make log files searchable.
For #2, sensors have no knowledge about why a sensor timed out. After timeout or execution_timeout is reached, they simply raise an Exception. You can deal with exceptions using trigger_rules, but these still don't take the reason for an exception into account.
If you want more control over this, I would implement your own Sensor which takes an argument e.g. data_not_ready_timeout (which is smaller than timeout and execution_timeout). In the poke() method, check if data_not_ready_timeout has been reached, and raise an AirflowSkipException if so. This will skip downstream tasks. Once timeout or execution_timeout are reached, the task is failed. Look at BaseSensorOperator.execute() for some inspiration to get the initial starting date of a sensor.
I have an airflow instance that had been running with no problem for 2 months until Sunday. There was a blackout in a system on which my airflow tasks depend and some tasks where queued for 2 days. After that we decided it was better to mark all the tasks for that day as failed and just lose that data.
Nevertheless, now all the new tasks get trigger at the proper time but they are never being set to any state (neither queued nor running). I check the logs and I see this output:
Dependencies Blocking Task From Getting Scheduled
All dependencies are met but the task instance is not running. In most cases this just means that the task will probably be scheduled soon unless:
The scheduler is down or under heavy load
The following configuration values may be limiting the number of queueable processes: parallelism, dag_concurrency, max_active_dag_runs_per_dag, non_pooled_task_slot_count
This task instance already ran and had its state changed manually (e.g. cleared in the UI)
I get the impression the 3rd topic is the reason why it is not working.
The scheduler and the webserver were working, however I restarted the scheduler and still I am having the same outcome. I also deleted the data in mysql database for one job and it is still not running.
I also saw a couple of post that said it is not running because the depens_on_past was set to true and if the previous runs failed, the next one will never be executed. I also checked it and it is not my case.
Any input would be really apreciated.
Any ideas? Thanks
While debugging a similar issue i found this setting: AIRFLOW__SCHEDULER__MAX_DAGRUNS_PER_LOOP_TO_SCHEDULE (or http://airflow.apache.org/docs/apache-airflow/2.0.1/configurations-ref.html#max-dagruns-per-loop-to-schedule), checking the airflow code it seems that the scheduler queries for dagruns to examine (consider to run ti's for), this query is limited to that number of rows (or 20 by default). So if you have >20 dagruns that are in some way blocked (in our case because ti's were on up-for-retry), then it won't consider other dagruns even though these could run fine.
A similar question has been asked before but as far as I can tell, it can't be explained by this answer since I don't have any tasks with "all_done" trigger rule.
As seen from the image above, a task has failed but that DAG is still marked as a success. The trigger rule for all the tasks but one is the default one.
Any idea how to mark the whole DAG as Failed in such case?