Airflow 2.0.2 - Hourly DAG getting stuck seeing Refreshing TaskInstance repeatedly - airflow

I've been noticing that some of the DAG runs for an hourly DAG are being skipped, I checked the log for the DAG run before it started skipping and noticed it had actually been running for 7 hours which is why other DAG runs didn't happen, it is very strange since it usually only takes 30 min to finish running.
We're using Airflow version 2.0.2
This is what I saw in the logs:
2022-05-06 13:26:56,668] {taskinstance.py:595} DEBUG - Refreshing TaskInstance <TaskInstance: dfp_hourly.revequery 2022-05-05T13:00:00+00:00 [running]> from DB
[2022-05-06 13:26:56,806] {taskinstance.py:630} DEBUG - Refreshed TaskInstance <TaskInstance: dfp_hourly.revequery 2022-05-05T13:00:00+00:00 [running]>
[2022-05-06 13:27:01,860] {taskinstance.py:595} DEBUG - Refreshing TaskInstance <TaskInstance: dfp_hourly.revequery 2022-05-05T13:00:00+00:00 [running]> from DB
[2022-05-06 13:27:01,872] {taskinstance.py:630} DEBUG - Refreshed TaskInstance <TaskInstance: dfp_hourly.revequery 2022-05-05T13:00:00+00:00 [running]>
[2022-05-06 13:27:06,960] {taskinstance.py:595} DEBUG - Refreshing TaskInstance <TaskInstance: dfp_hourly.revequery 2022-05-05T13:00:00+00:00 [running]> from DB
[2022-05-06 13:27:07,019] {taskinstance.py:630} DEBUG - Refreshed TaskInstance <TaskInstance: dfp_hourly.revequery 2022-05-05T13:00:00+00:00 [running]>
[2022-05-06 13:27:12,224] {taskinstance.py:595} DEBUG - Refreshing TaskInstance <TaskInstance: dfp_hourly.revequery 2022-05-05T13:00:00+00:00 [running]> from DB
[2022-05-06 13:27:12,314] {taskinstance.py:630} DEBUG - Refreshed TaskInstance <TaskInstance: dfp_hourly.revequery 2022-05-05T13:00:00+00:00 [running]>
[2022-05-06 13:27:17,368] {taskinstance.py:595} DEBUG - Refreshing TaskInstance <TaskInstance: dfp_hourly.revequery 2022-05-05T13:00:00+00:00 [running]> from DB
[2022-05-06 13:27:17,377] {taskinstance.py:630} DEBUG - Refreshed TaskInstance

well, I think you are running too many task-parallel which causes them to run for hours, well this can be fixed by using Pool. Airflow pools can be used to limit the execution parallelism on arbitrary sets of tasks. The list of pools is managed in the UI (Menu -> Admin -> Pools) by giving the pools a name and assigning it several worker slots.
Tasks can then be associated with one of the existing pools by using the pool parameter when creating tasks:
aggregate_db_message_job = BashOperator(
task_id="aggregate_db_message_job",
execution_timeout=timedelta(hours=3),
pool="ep_data_pipeline_db_msg_agg",
bash_command=aggregate_db_message_job_cmd,
dag=dag,
)
aggregate_db_message_job.set_upstream(wait_for_empty_queue)
Tasks will be scheduled as usual while the slots fill up. The number of slots occupied by a task can be configured by pool_slots (see the section below). Once capacity is reached, runnable tasks get queued and their state will show as such in the UI. As slots free up, queued tasks start running based on the Priority Weights of the task and its descendants.
Note that if tasks are not given a pool, they are assigned to a default pool default_pool, which is initialized with 128 slots and can be modified through the UI or CLI (but cannot be removed).

Related

Airflow: Indefinitely running HTTP Task with no response

Please help me understand on why is this http task running for long time, with no progress.
I`m running the official example on HTTP but looks like missing something here.
https://github.com/apache/airflow/blob/providers-http/4.1.1/tests/system/providers/http/example_http.py
AIRFLOW_CTX_DAG_EMAIL=airflow#example.com
AIRFLOW_CTX_DAG_OWNER=airflow
AIRFLOW_CTX_DAG_ID=example_http_operator
AIRFLOW_CTX_TASK_ID=http_sensor_check
AIRFLOW_CTX_EXECUTION_DATE=2023-02-17T20:53:45.614721+00:00
AIRFLOW_CTX_TRY_NUMBER=1
AIRFLOW_CTX_DAG_RUN_ID=manual__2023-02-17T20:53:45.614721+00:00
[2023-02-17, 20:53:48 UTC] {__init__.py:117} DEBUG - Preparing lineage inlets and outlets
[2023-02-17, 20:53:48 UTC] {__init__.py:155} DEBUG - inlets: [], outlets: []
[2023-02-17, 20:53:48 UTC] {http.py:122} INFO - Poking:
[2023-02-17, 20:53:48 UTC] {base.py:73} INFO - Using connection ID 'http_default' for task execution.
[2023-02-17, 20:53:48 UTC] {http.py:150} DEBUG - Sending 'GET' to url: https://jsonplaceholder.typicode.com/
[2023-02-17, 20:53:52 UTC] {taskinstance.py:769} DEBUG - Refreshing TaskInstance <TaskInstance: example_http_operator.http_sensor_check manual__2023-02-17T20:53:45.614721+00:00 [running]> from DB
[2023-02-17, 20:53:52 UTC] {base_job.py:240} DEBUG - [heartbeat]
[2023-02-17, 20:53:58 UTC] {taskinstance.py:769} DEBUG - Refreshing TaskInstance <TaskInstance: example_http_operator.http_sensor_check manual__2023-02-17T20:53:45.614721+00:00 [running]> from DB
[2023-02-17, 20:53:58 UTC] {base_job.py:240} DEBUG - [heartbeat]
[2023-02-17, 20:54:03 UTC] {taskinstance.py:769} DEBUG - Refreshing TaskInstance <TaskInstance: example_http_operator.http_sensor_check manual__2023-02-17T20:53:45.614721+00:00 [running]> from DB
[2023-02-17, 20:54:03 UTC] {base_job.py:240} DEBUG - [heartbeat]
[2023-02-17, 20:54:08 UTC] {taskinstance.py:769} DEBUG - Refreshing TaskInstance <TaskInstance: example_http_operator.http_sensor_check manual__2023-02-17T20:53:45.614721+00:00 [running]> from DB
[2023-02-17, 20:54:08 UTC] {base_job.py:240} DEBUG - [heartbeat]
[2023-02-17, 20:54:13 UTC] {taskinstance.py:769} DEBUG - Refreshing TaskInstance <TaskInstance: example_http_operator.http_sensor_check manual__2023-02-17T20:53:45.614721+00:00 [running]> from DB
Surprisingly, I`m able to test this code from CLI without any issue but having trouble run this from UI.
AIRFLOW_CTX_DAG_EMAIL=airflow#example.com
AIRFLOW_CTX_DAG_OWNER=airflow
AIRFLOW_CTX_DAG_ID=example_http_operator
AIRFLOW_CTX_TASK_ID=http_sensor_check
AIRFLOW_CTX_EXECUTION_DATE=2023-02-17T21:05:22.781965+00:00
AIRFLOW_CTX_TRY_NUMBER=1
AIRFLOW_CTX_DAG_RUN_ID=__airflow_temporary_run_2023-02-17T21:05:22.781968+00:00__
[2023-02-17 16:05:23,328] {__init__.py:117} DEBUG - Preparing lineage inlets and outlets
[2023-02-17 16:05:23,328] {__init__.py:155} DEBUG - inlets: [], outlets: []
[2023-02-17 16:05:23,329] {http.py:122} INFO - Poking:
[2023-02-17 16:05:23,332] {base.py:73} INFO - Using connection ID 'http_default' for task execution.
[2023-02-17 16:05:23,332] {http.py:150} DEBUG - Sending 'GET' to url: https://jsonplaceholder.typicode.com/
[2023-02-17 16:05:23,335] {connectionpool.py:1003} DEBUG - Starting new HTTPS connection (1): jsonplaceholder.typicode.com:443
[2023-02-17 16:05:23,667] {connectionpool.py:456} DEBUG - https://jsonplaceholder.typicode.com:443 "GET / HTTP/1.1" 200 None
[2023-02-17 16:05:23,669] {base.py:228} INFO - Success criteria met. Exiting.
[2023-02-17 16:05:23,669] {__init__.py:75} DEBUG - Lineage called with inlets: [], outlets: []
[2023-02-17 16:05:23,670] {taskinstance.py:1329} DEBUG - Clearing next_method and next_kwargs.
[2023-02-17 16:05:23,670] {taskinstance.py:1318} INFO - Marking task as SUCCESS. dag_id=example_http_operator, task_id=http_sensor_check, execution_date=20230217T210522, start_date=, end_date=20230217T210523
[2023-02-17 16:05:23,670] {taskinstance.py:2241} DEBUG - Task Duration set to None
[2023-02-17 16:05:23,696] {cli_action_loggers.py:83} DEBUG - Calling callbacks: []
[2023-02-17 16:05:23,696] {settings.py:407} DEBUG - Disposing DB connection pool (PID 65429)

Why are my Airflow tasks being "externally set to failed"?

I'm using Airflow 2.0.0, and my tasks are sporadically being killed "externally" after running for a few seconds or minutes. The tasks usually run successfully (both for manual task initiated via airflow tasks test ... and for scheduled DAG runs), so I believe this is not related to my DAG code.
When tasks fail, this seems to be the key error from the task logs:
{local_task_job.py:170} WARNING - State of this instance has been externally set to failed. Terminating instance.
[2020-12-20 11:26:11,448] {taskinstance.py:826} INFO - Dependencies all met for <TaskInstance: daily_backups.run_backupper 2020-12-19T02:00:00+00:00 [queued]>
[2020-12-20 11:26:11,473] {taskinstance.py:826} INFO - Dependencies all met for <TaskInstance: daily_backups.run_backupper 2020-12-19T02:00:00+00:00 [queued]>
[2020-12-20 11:26:11,473] {taskinstance.py:1017} INFO -
--------------------------------------------------------------------------------
[2020-12-20 11:26:11,473] {taskinstance.py:1018} INFO - Starting attempt 3 of 3
[2020-12-20 11:26:11,473] {taskinstance.py:1019} INFO -
--------------------------------------------------------------------------------
[2020-12-20 11:26:11,506] {taskinstance.py:1038} INFO - Executing <Task(PythonOperator): run_backupper> on 2020-12-19T02:00:00+00:00
[2020-12-20 11:26:11,509] {standard_task_runner.py:51} INFO - Started process 12059 to run task
[2020-12-20 11:26:11,515] {standard_task_runner.py:75} INFO - Running: ['airflow', 'tasks', 'run', 'daily_backups', 'run_backupper', '2020-12-19T02:00:00+00:00', '--job-id', '22', '--pool', 'default_pool', '--raw', '--subdir', 'DAGS_FOLDER/backupper/daily_backups.py', '--cfg-path', '/tmp/tmpnfmqtorg']
[2020-12-20 11:26:11,517] {standard_task_runner.py:76} INFO - Job 22: Subtask run_backupper
[2020-12-20 11:26:11,609] {logging_mixin.py:103} INFO - Running <TaskInstance: daily_backups.run_backupper 2020-12-19T02:00:00+00:00 [running]> on host localhost
[2020-12-20 11:26:11,742] {taskinstance.py:1232} INFO - Exporting the following env vars:
AIRFLOW_CTX_DAG_OWNER=<user>
AIRFLOW_CTX_DAG_ID=daily_backups
AIRFLOW_CTX_TASK_ID=run_backupper
AIRFLOW_CTX_EXECUTION_DATE=2020-12-19T02:00:00+00:00
AIRFLOW_CTX_DAG_RUN_ID=scheduled__2020-12-19T02:00:00+00:00
...
... my job's logs, indicating that the job is running healthily ...
...
[2020-12-20 11:26:16,587] {local_task_job.py:170} WARNING - State of this instance has been externally set to failed. Terminating instance.
[2020-12-20 11:26:16,593] {process_utils.py:95} INFO - Sending Signals.SIGTERM to GPID 12059
[2020-12-20 11:27:16,609] {process_utils.py:108} WARNING - process psutil.Process(pid=12059, name='airflow task runner: daily_backups run_backupper 2020-12-19T02:00:00+00:00 22', status='sleeping', started='11:26:11') did not respond to SIGTERM. Trying SIGKILL
[2020-12-20 11:27:16,618] {process_utils.py:61} INFO - Process psutil.Process(pid=12059, name='airflow task runner: daily_backups run_backupper 2020-12-19T02:00:00+00:00 22', status='terminated', exitcode=<Negsignal.SIGKILL: -9>, started='11:26:11') (12059) terminated with exit code Negsignal.SIGKILL
[2020-12-20 11:27:16,618] {local_task_job.py:118} INFO - Task exited with return code Negsignal.SIGKILL
The final few lines in the logs are not consistent. Here is a different version, for the same task that failed in an earlier attempt:
... same stuff as before ...
[2020-12-20 02:01:12,689] {local_task_job.py:170} WARNING - State of this instance has been externally set to failed. Terminating instance.
[2020-12-20 02:01:12,695] {process_utils.py:95} INFO - Sending Signals.SIGTERM to GPID 24442
[2020-12-20 02:02:00,462] {taskinstance.py:1214} ERROR - Received SIGTERM. Terminating subprocesses.
[2020-12-20 02:02:00,498] {process_utils.py:61} INFO - Process psutil.Process(pid=24442, status='terminated', exitcode=0, started='02:00:10') (24442) terminated with exit code 0
[2020-12-20 02:02:00,499] {local_task_job.py:118} INFO - Task exited with return code 0
I suspect in this case the script was able to respond to the SIGTERM in time, whereas in the previous case it was blocked on a long-running query and was not able to terminate cleanly.
I believe the problem was that the scheduler health check threshold was set to be smaller than the scheduler heartbeat interval.
In my config I had set scheduler_health_check_threshold to 30 seconds and scheduler_heartbeat_sec to 60 seconds. During the check for orphaned tasks (itself governed by a different parameter, orphaned_tasks_check_interval), the scheduler heartbeat was determined to be older than 30 seconds, which makes sense, because it was only heartbeating every 60 seconds. Thus the scheduler was inferred to be unhealthy and was therefore terminated.
Around the time of the failure, I could see messages like these in /var/log/syslog
Dec 20 11:26:14 localhost bash[11545]: [2020-12-20 11:26:14,368] {scheduler_job.py:1751} INFO - Resetting orphaned tasks for active dag runs
Dec 20 11:26:14 localhost bash[11545]: [2020-12-20 11:26:14,373] {scheduler_job.py:1764} INFO - Marked 1 SchedulerJob instances as failed
Dec 20 11:26:14 localhost bash[11545]: [2020-12-20 11:26:14,381] {scheduler_job.py:1805} INFO - Reset the following 1 orphaned TaskInstances:
Dec 20 11:26:14 localhost bash[11545]: #011<TaskInstance: daily_backups.run_backupper 2020-12-19 02:00:00+00:00 [running]>
Dec 20 11:26:14 localhost bash[11545]: [2020-12-20 11:26:14,571] {scheduler_job.py:938} INFO - 1 tasks up for execution:
Dec 20 11:26:14 localhost bash[11545]: #011<TaskInstance: daily_backups.run_backupper 2020-12-19 02:00:00+00:00 [scheduled]>
Dec 20 11:26:14 localhost bash[11545]: [2020-12-20 11:26:14,574] {scheduler_job.py:972} INFO - Figuring out tasks to run in Pool(name=default_pool) with 128 open slots and 1 task instances ready to be queued
Dec 20 11:26:14 localhost bash[11545]: [2020-12-20 11:26:14,575] {scheduler_job.py:999} INFO - DAG daily_backups has 0/16 running and queued tasks
Dec 20 11:26:14 localhost bash[11545]: [2020-12-20 11:26:14,575] {scheduler_job.py:1060} INFO - Setting the following tasks to queued state:
Dec 20 11:26:14 localhost bash[11545]: #011<TaskInstance: daily_backups.run_backupper 2020-12-19 02:00:00+00:00 [scheduled]>
Dec 20 11:26:14 localhost bash[11545]: [2020-12-20 11:26:14,578] {scheduler_job.py:1102} INFO - Sending TaskInstanceKey(dag_id='daily_backups', task_id='run_backupper', execution_date=datetime.datetime(2020, 12, 19, 2, 0, tzinfo=Timezone('UTC')), try_number=4) to executor with priority 2 and queue default
Dec 20 11:26:14 localhost bash[11545]: [2020-12-20 11:26:14,578] {base_executor.py:79} INFO - Adding to queue: ['airflow', 'tasks', 'run', 'daily_backups', 'run_backupper', '2020-12-19T02:00:00+00:00', '--local', '--pool', 'default_pool', '--subdir', '/storage/airflow/dags/backupper/daily_backups.py']
Dec 20 11:26:14 localhost bash[11545]: [2020-12-20 11:26:14,581] {local_executor.py:81} INFO - QueuedLocalWorker running ['airflow', 'tasks', 'run', 'daily_backups', 'run_backupper', '2020-12-19T02:00:00+00:00', '--local', '--pool', 'default_pool', '--subdir', '/storage/airflow/dags/backupper/daily_backups.py']
Dec 20 11:26:14 localhost bash[11545]: [2020-12-20 11:26:14,707] {dagbag.py:440} INFO - Filling up the DagBag from /storage/airflow/dags/backupper/daily_backups.py
Dec 20 11:26:15 localhost bash[11545]: Running <TaskInstance: daily_backups.run_backupper 2020-12-19T02:00:00+00:00 [queued]> on host localhost
and the timestamps coincide closely with the SIGTERM received by my task. I guess that since the SchedulerJob was marked as failed, then the TaskInstance running my actual task was considered an orphan, and thus marked for termination. At the same time it scheduled a new attempt (try_number=4).
Increasing the scheduler_health_check_threshold to 120 seconds and restarting the scheduler/webserver services appears to have resolved my issue.
I had the same issue.
From logs:
2021-05-07 13:04:19,960 INFO - Resetting orphaned tasks for active dag runs
2021-05-07 13:09:20,060 INFO - Resetting orphaned tasks for active dag runs
2021-05-07 13:14:20,186 INFO - Resetting orphaned tasks for active dag runs
2021-05-07 13:19:20,263 INFO - Resetting orphaned tasks for active dag runs
2021-05-07 13:24:20,399 INFO - Resetting orphaned tasks for active dag runs
2021-05-07 13:29:20,729 INFO - Resetting orphaned tasks for active dag runs
2021-05-07 13:34:20,892 INFO - Resetting orphaned tasks for active dag runs
2021-05-07 13:39:21,070 INFO - Resetting orphaned tasks for active dag runs
2021-05-07 13:44:21,328 INFO - Resetting orphaned tasks for active dag runs
2021-05-07 13:49:21,423 INFO - Resetting orphaned tasks for active dag runs
And my scheduler config was as follows:
# Task instances listen for external kill signal (when you clear tasks
# from the CLI or the UI), this defines the frequency at which they should
# listen (in seconds).
job_heartbeat_sec = 5
I had the same issue in AKS (Azure Kubernetes).
I resolved it with setting AIRFLOW__SCHEDULER__SCHEDULE_AFTER_TASK_EXECUTION to False.
https://github.com/apache/airflow/issues/14672
I had the same problem, running on a MacBook. Seems to be the MacBook going to sleep, solved it by ticking "Prevent your Mac from automatically sleeping when the display is off" in the "Power Adapter" section of preferences -> battery. (When on a charger)

Airlfow Execution Timeout not working well

I've set 'execution_timeout': timedelta(seconds=300) parameter on many tasks. When the execution timeout is set on task downloading data from Google Analytics it works properly - after ~300 seconds is the task set to failed. The task downloads some data from API (python), then it does some transformations (python) and loads data into PostgreSQL.
Then I've a task which executes only one PostgreSQL function - execution sometimes takes more than 300 seconds but I get this (task is marked as finished successfully).
*** Reading local file: /home/airflow/airflow/logs/bulk_replication_p2p_realtime/t1/2020-07-20T00:05:00+00:00/1.log
[2020-07-20 05:05:35,040] {__init__.py:1139} INFO - Dependencies all met for <TaskInstance: bulk_replication_p2p_realtime.t1 2020-07-20T00:05:00+00:00 [queued]>
[2020-07-20 05:05:35,051] {__init__.py:1139} INFO - Dependencies all met for <TaskInstance: bulk_replication_p2p_realtime.t1 2020-07-20T00:05:00+00:00 [queued]>
[2020-07-20 05:05:35,051] {__init__.py:1353} INFO -
--------------------------------------------------------------------------------
[2020-07-20 05:05:35,051] {__init__.py:1354} INFO - Starting attempt 1 of 1
[2020-07-20 05:05:35,051] {__init__.py:1355} INFO -
--------------------------------------------------------------------------------
[2020-07-20 05:05:35,098] {__init__.py:1374} INFO - Executing <Task(PostgresOperator): t1> on 2020-07-20T00:05:00+00:00
[2020-07-20 05:05:35,099] {base_task_runner.py:119} INFO - Running: ['airflow', 'run', 'bulk_replication_p2p_realtime', 't1', '2020-07-20T00:05:00+00:00', '--job_id', '958216', '--raw', '-sd', 'DAGS_FOLDER/bulk_replication_p2p_realtime.py', '--cfg_path', '/tmp/tmph11tn6fe']
[2020-07-20 05:05:37,348] {base_task_runner.py:101} INFO - Job 958216: Subtask t1 [2020-07-20 05:05:37,347] {settings.py:182} INFO - settings.configure_orm(): Using pool settings. pool_size=10, pool_recycle=1800, pid=26244
[2020-07-20 05:05:39,503] {base_task_runner.py:101} INFO - Job 958216: Subtask t1 [2020-07-20 05:05:39,501] {__init__.py:51} INFO - Using executor LocalExecutor
[2020-07-20 05:05:39,857] {base_task_runner.py:101} INFO - Job 958216: Subtask t1 [2020-07-20 05:05:39,856] {__init__.py:305} INFO - Filling up the DagBag from /home/airflow/airflow/dags/bulk_replication_p2p_realtime.py
[2020-07-20 05:05:39,894] {base_task_runner.py:101} INFO - Job 958216: Subtask t1 [2020-07-20 05:05:39,894] {cli.py:517} INFO - Running <TaskInstance: bulk_replication_p2p_realtime.t1 2020-07-20T00:05:00+00:00 [running]> on host dwh2-airflow-dev
[2020-07-20 05:05:39,938] {postgres_operator.py:62} INFO - Executing: CALL dw_system.bulk_replicate(p_graph_name=>'replication_p2p_realtime',p_group_size=>4 , p_group=>1, p_dag_id=>'bulk_replication_p2p_realtime', p_task_id=>'t1')
[2020-07-20 05:05:39,960] {logging_mixin.py:95} INFO - [2020-07-20 05:05:39,953] {base_hook.py:83} INFO - Using connection to: id: postgres_warehouse. Host: XXX Port: 5432, Schema: XXXX Login: XXX Password: XXXXXXXX, extra: {}
[2020-07-20 05:05:39,973] {logging_mixin.py:95} INFO - [2020-07-20 05:05:39,972] {dbapi_hook.py:171} INFO - CALL dw_system.bulk_replicate(p_graph_name=>'replication_p2p_realtime',p_group_size=>4 , p_group=>1, p_dag_id=>'bulk_replication_p2p_realtime', p_task_id=>'t1')
[2020-07-20 05:23:21,450] {logging_mixin.py:95} INFO - [2020-07-20 05:23:21,449] {timeout.py:42} ERROR - Process timed out, PID: 26244
[2020-07-20 05:23:36,453] {logging_mixin.py:95} INFO - [2020-07-20 05:23:36,452] {jobs.py:2562} INFO - Task exited with return code 0
Does anyone know how to enforce execution timeout out for such long running functions? It seems that the execution timeout is evaluated once the PG function finish.
Airflow uses the signal module from the standard library to affect a timeout. In Airflow it's used to hook into these system signals and request that the calling process be notified in N seconds and, should the process still be inside the context (see the __enter__ and __exit__ methods on the class) it will raise an AirflowTaskTimeout exception.
Unfortunately for this situation, there are certain classes of system operations that cannot be interrupted. This is actually called out in the signal documentation:
A long-running calculation implemented purely in C (such as regular expression matching on a large body of text) may run uninterrupted for an arbitrary amount of time, regardless of any signals received. The Python signal handlers will be called when the calculation finishes.
To which we say "But I'm not doing a long-running calculation in C!" -- yeah for Airflow this is almost always due to uninterruptable I/O operations.
The highlighted sentence above (emphasis mine) nicely explains why the handler is still triggered even after the task is allowed to (frustratingly!) finish, well beyond your requested timeout.

Airflow dag_id did not exist or it failed to parse

currently im learning how to use Apache Airflow and trying to create a simple DAG script like this
from datetime import datetime
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
def print_hello():
return 'Hello world!'
dag = DAG('hello_world', description='Simple tutorial DAG',
schedule_interval='0 0 * * *',
start_date=datetime(2020, 5, 23), catchup=False)
dummy_operator = DummyOperator(task_id='dummy_task', retries=3, dag=dag)
hello_operator = PythonOperator(task_id='hello_task', python_callable=print_hello, dag=dag)
dummy_operator >> hello_operator
i run those DAG using web server and run succesfully even checked the logs
[2020-05-23 20:43:53,411] {taskinstance.py:669} INFO - Dependencies all met for <TaskInstance: hello_world.hello_task 2020-05-23T13:42:17.463955+00:00 [queued]>
[2020-05-23 20:43:53,431] {taskinstance.py:669} INFO - Dependencies all met for <TaskInstance: hello_world.hello_task 2020-05-23T13:42:17.463955+00:00 [queued]>
[2020-05-23 20:43:53,432] {taskinstance.py:879} INFO -
--------------------------------------------------------------------------------
[2020-05-23 20:43:53,432] {taskinstance.py:880} INFO - Starting attempt 1 of 1
[2020-05-23 20:43:53,432] {taskinstance.py:881} INFO -
--------------------------------------------------------------------------------
[2020-05-23 20:43:53,448] {taskinstance.py:900} INFO - Executing <Task(PythonOperator): hello_task> on 2020-05-23T13:42:17.463955+00:00
[2020-05-23 20:43:53,477] {standard_task_runner.py:53} INFO - Started process 7442 to run task
[2020-05-23 20:43:53,685] {logging_mixin.py:112} INFO - Running %s on host %s <TaskInstance: hello_world.hello_task 2020-05-23T13:42:17.463955+00:00 [running]> LAPTOP-9BCTKM5O.localdomain
[2020-05-23 20:43:53,715] {python_operator.py:114} INFO - Done. Returned value was: Hello world!
[2020-05-23 20:43:53,738] {taskinstance.py:1052} INFO - Marking task as SUCCESS.dag_id=hello_world, task_id=hello_task, execution_date=20200523T134217, start_date=20200523T134353, end_date=20200523T134353
[2020-05-23 20:44:03,372] {logging_mixin.py:112} INFO - [2020-05-23 20:44:03,372] {local_task_job.py:103} INFO - Task exited with return code 0
but when i tried to test run a single task using this command
airflow test dags/main.py hello_task 2020-05-23
it shows this error
airflow.exceptions.AirflowException: dag_id could not be found: dags/main.py. Either the dag did not exist or it failed to parse.
where i went wrong ?
You got your airflow test command a tad wrong, instead of giving the path to the dag, dags/main.py, you need to type in the dag_id itself which is hello_world looking at your code.
So try this:
airflow test hello_world hello_task 2020-05-23
You should get output similar to this :)
airflow#940836ce7da4:/opt/airflow$ airflow test hello_world hello_task 2020-05-23
[2020-05-23 14:18:51,144] {__init__.py:51} INFO - Using executor CeleryExecutor
[2020-05-23 14:18:51,145] {dagbag.py:396} INFO - Filling up the DagBag from /opt/airflow/dags
[2020-05-23 14:18:51,190] {taskinstance.py:669} INFO - Dependencies all met for <TaskInstance: hello_world.hello_task 2020-05-23T00:00:00+00:00 [None]>
[2020-05-23 14:18:51,203] {taskinstance.py:669} INFO - Dependencies all met for <TaskInstance: hello_world.hello_task 2020-05-23T00:00:00+00:00 [None]>
[2020-05-23 14:18:51,203] {taskinstance.py:879} INFO -
--------------------------------------------------------------------------------
[2020-05-23 14:18:51,203] {taskinstance.py:880} INFO - Starting attempt 1 of 1
[2020-05-23 14:18:51,203] {taskinstance.py:881} INFO -
--------------------------------------------------------------------------------
[2020-05-23 14:18:51,204] {taskinstance.py:900} INFO - Executing <Task(PythonOperator): hello_task> on 2020-05-23T00:00:00+00:00
[2020-05-23 14:18:51,234] {python_operator.py:114} INFO - Done. Returned value was: Hello world!
[2020-05-23 14:18:51,249] {taskinstance.py:1065} INFO - Marking task as SUCCESS.dag_id=hello_world, task_id=hello_task, execution_date=20200523T000000, start_date=20200523T141851, end_date=20200523T141851
After af 2.0;
airflow tasks test dag_id task_id date

DummyOperator marked upstream_failed yet all upstream tasks marked success

I have an Airflow pipeline that produces 12 staging tables from Google Cloud Storage files and then performs some downstream processing. I have a DummyOperator to collect these tasks before proceeding to the next stages.
I'm getting an error on the wait_stg_load operator saying it's in an upstream_failed state. However all of the upstream tasks are marked as success. The DAG itself is now marked as failed. If I clear the status on wait_stg_load, everything proceeds fine. Any ideas on what I'm doing wrong?
I am using Google Cloud Composer which is version Airflow v 1.9 on Python 3
with DAG('load_data',
default_args=default_args,
schedule_interval='0 9 * * *',
concurrency=3
) as dag:
t2 = DummyOperator(
task_id='wait_stg_load',
dag=dag
)
for t in tables:
t1 = GoogleCloudStorageToBigQueryOperator(
task_id='load_stg_{}'.format(t.replace('.','_')),
bucket='my-bucket',
source_objects=['data/{}.json'.format(t)],
destination_project_dataset_table='{}.stg_{}'.format(DATASET_NAME, t.replace('.','_')),
schema_object='data/schemas/{}.json'.format(t),
source_format='NEWLINE_DELIMITED_JSON',
write_disposition='WRITE_TRUNCATE',
dag=dag
)
t1 >> t2
Update 1
I believe this is a concurrency issue within Airflow. I noticed that the task does indeed fail at some point, but later runs anyway. It gets marked complete, yet the DummyOperator doesn't see that.
[2019-02-14 09:00:14,734] {cli.py:374} INFO - Running on host airflow-worker
[2019-02-14 09:00:16,686] {models.py:1196} INFO - Dependencies all met for <TaskInstance: dag.task 2019-02-13 09:00:00 [queued]>
[2019-02-14 09:00:16,694] {models.py:1189} INFO - Dependencies not met for <TaskInstance: dag.task 2019-02-13 09:00:00 [queued]>, dependency 'Task Instance Slots Available' FAILED: The maximum number of running tasks (3) for this task's DAG 'dag' has been reached.
[2019-02-14 09:00:16,694] {models.py:1389} WARNING -
-------------------------------------------------------------------------------
FIXME: Rescheduling due to concurrency limits reached at task runtime. Attempt 1 of 1. State set to NONE
-------------------------------------------------------------------------------
[2019-02-14 09:00:16,694] {models.py:1392} INFO - Queuing into pool None
[2019-02-14 09:00:26,619] {cli.py:374} INFO - Running on host airflow-worker
[2019-02-14 09:00:28,563] {models.py:1196} INFO - Dependencies all met for <TaskInstance: dag.task 2019-02-13 09:00:00 [failed]>
[2019-02-14 09:00:28,570] {models.py:1196} INFO - Dependencies all met for <TaskInstance: dag.task 2019-02-13 09:00:00 [failed]>
[2019-02-14 09:00:28,570] {models.py:1406} INFO -
-------------------------------------------------------------------------------
Starting attempt 1 of
-------------------------------------------------------------------------------
[2019-02-14 09:00:28,607] {models.py:1427} INFO - Executing <Task(GoogleCloudStorageToBigQueryOperator): task> on 2019-02-13 09:00:00

Resources