I'm running Airflow on a 4 CPU machine with LocalExecutor
I've defined an upstream task to be one success
create_spark_cluster_task = BashOperator(
task_id='create_spark_cluster',
trigger_rule='one_success',
bash_command= ...,
dag=dag)
...
download_bag_data_task >> create_spark_cluster_task
download_google_places_data_task >> create_spark_cluster_task
download_facebook_places_details_data_task >> create_spark_cluster_task
download_facebook_places_details_data_task_2 >> create_spark_cluster_task
download_facebook_places_details_data_task_3 >> create_spark_cluster_task
download_factual_data_task >> create_spark_cluster_task
download_dataoutlet_data_task >> create_spark_cluster_task
But even though some are clearly marked as success the task does not trigger
The 'download tasks' do run in parallel, so that cannot be the issue
Inspecting the tasks shows:
Dependency: Unknown
Reason: All dependencies are met but the task
instance is not running. In most cases this just means that the task
will probably be scheduled soon unless:
- The scheduler is down or under heavy load
- This task instance already ran and had it's state changed manually (e.g. cleared in the UI)
I've looked at the load and it's indeed pretty high:
load average: 2.45, 3.55, 3.71
CPU is at 50-60%
But that other tasks have already finished, so there should be resources free to start another task, right?
Related
Aiflow: 2.1.2 -
Executor: KubernetesExecutor -
Python: 3.7
I have written tasks using Airflow 2+ TaskFlow API and running the Airflow application in KubernetesExecutor mode. There are success and failure callbacks on the task but sometimes they get missed.
I've tried to specify the callbacks both via default_args on DAG and directly in the task decorator but seeing same behaviour.
#task(
on_success_callback=common.on_success_callback,
on_failure_callback=common.on_failure_callback,
)
def delta_load_pstn(files):
# doing something here
Here are the closing logs of the task
2022-04-26 11:21:38,494] Marking task as SUCCESS. dag_id=delta_load_pstn, task_id=dq_process, execution_date=20220426T112104, start_date=20220426T112131, end_date=20220426T112138
[2022-04-26 11:21:38,548] 1 downstream tasks scheduled from follow-on schedule check
[2022-04-26 11:21:42,069] State of this instance has been externally set to success. Terminating instance.
[2022-04-26 11:21:42,070] Sending Signals.SIGTERM to GPID 34
[2022-04-26 11:22:42,081] process psutil.Process(pid=34, name='airflow task runner: delta_load_pstn dq_process 2022-04-26T11:21:04.747263+00:00 500', status='sleeping', started='11:21:31') did not respond to SIGTERM. Trying SIGKILL
[2022-04-26 11:22:42,095] Process psutil.Process(pid=34, name='airflow task runner: delta_load_pstn dq_process 2022-04-26T11:21:04.747263+00:00 500', status='terminated', exitcode=<Negsignal.SIGKILL: -9>, started='11:21:31') (34) terminated with exit code Negsignal.SIGKILL
[2022-04-26 11:22:42,095] Job 500 was killed before it finished (likely due to running out of memory)
And i can see in the task instance details that the callbacks are configured.
If I implement the on_execute_callback which is called before the execution of the task, I do get the alert (in Slack). So my guess is it's definitely something with killing the pod before the callback is handled.
Is there a way to add a task that runs once all other tasks have run successfully in the same DAG, see below for my current DAG.
For example my current tasks run in the below order, but I want to add new_task once all of the below runs. If I leave it as the below it won't run new_task:
for endpoint in ENDPOINTS:
latest_only = (operator...)
s3 = (operator...)
etc ....
latest_only >> s3 >> short_circuit
short_circuit >> snowflake >> success
short_circuit >> postgres >> success
if endpoint.name == "io_lineitems":
success >> il_io_lineitems_tables
copy_monthly_billing >> load_io_monthly_billing_to_snowflake
copy_monthly_billing >> load_io_monthly_billing_to_postgres
new_task
Use trigger rules to trigger based on upstream task success status
trigger_rules
I am trying to add alerts to my airflow dags. The dags have multiple tasks, some upto 15.
I want to execute a bash script ( a general script for all dags ), in case any task at any point fails.
An example, a dag has tasks T1 to T5, as T1 >> T2 >> T3 >> T4 >> T5.
I want to trigger task A ( representing alerts ) in case any of these fail.
Would be really helpful in anyone can help me with the task hierarchy.
You have two options IMO. Failure callback and Trigger Rules
Success / Failure Callback
Airflow Task Instances have a concept of what to do in case of failure or success. These are callbacks that will be run in the case of a Task reaching a specific state... here are your options:
...
on_failure_callback=None,
on_success_callback=None,
on_retry_callback=None
...
Trigger Rules
Airflow Task Instances have a concept of what state of their upstream to trigger on with the default being ALL_SUCCESS. That means your main branch can stay as it is. And you can branch where you want with A from T1 as:
from airflow.utils.trigger_rule import TriggerRule
T1 >> DummyOperator(
dag=dag,
task_id="task_a",
trigger_rule=TriggerRule.ALL_FAILED
)
Alternatively, you can build your branch and include A as:
from airflow.utils.trigger_rule import TriggerRule
[T1, T2, T3, ...] >> DummyOperator(
dag=dag,
task_id="task_a",
trigger_rule=TriggerRule.ONE_FAILED
)
I have configured a dag in such a way that if current instance has failed next instance won't run. However, the problem is.
Problem
let's say past instance of the task is failed and current instance is in waiting state. Once I fix the issue how to run the current instance without making past run successful. I want to see the history when the task(dag) failed.
DAG
dag = DAG(
dag_id='test_airflow',
default_args=args,
tags=['wealth', 'python', 'ml'],
schedule_interval='5 13 * * *',
max_active_runs=1,
)
run_this = BashOperator(
task_id='run_after_loop',
bash_command='lll',
dag=dag,
depends_on_past=True
)
I guess you could trigger a task execution via cli using airflow run
There are two arguments that may help you:
-i, --ignore_dependencies - Ignore task-specific dependencies, e.g. upstream, depends_on_past, and retry delay dependencies
-I, --ignore_depends_on_past - Ignore depends_on_past dependencies (but respect upstream dependencies)
I was testing airflow 1.10, using the following dag:
dag = DAG(dag_id='something',
start_date=datetime(2019,1,2).replace(tzinfo=pytz.timezone('US/Eastern')),
schedule_interval='#once',
...)
I then have several bash operators:
o1 = BashOperator(bash_command="echo 0", dag=dag, task_id='o1')
o2 = BashOperator(bash_command="echo 0", dag=dag, task_id='o2')
o3 = BashOperator(bash_command="echo 0", dag=dag, task_id='o3')
o1 >> o2 >> o3
Airflow parses and displays the dag with no problem. However, when I trigger the dag, only the first task runs and gets marked as green. The dag then just remains in the 'running' state while all other tasks are marked in white meaning they are not picked up by the scheduler. I then receive an email saying:
Executor reports task instance finished (success) although the task says its queued. Was the task killed externally?
Ok..I think I figured it out, the problem is with the timezone-aware start_date. The problem went away once I removed replace(tzinfo=pytz.timezone('US/Eastern'))