Why airflow scheduler keeps running my DAG file? - airflow

I followed the tutorial, I created a folder $AIRFLOW_HOME/dags, and put the tutorial DAG python file there. I then start the airflow scheduler. By default it is paused. But if I look at the output of airflow scheduler, I saw lot of runs, trying to create the DAGs. Why it keeps running?
[2018-09-10 15:49:24,123] {jobs.py:1108} INFO - No tasks to consider for execution.
[2018-09-10 15:49:24,125] {jobs.py:1538} INFO -
================================================================================
DAG File Processing Stats
File Path PID Runtime Last Runtime Last Run
------------------------------------------------------------ ----- --------- -------------- -------------------
/Users/xiang/Documents/BigData/airflow/dags/my_tutorial_2.py 29257 0.44s 0.43s 2018-09-10T13:49:22
================================================================================
[2018-09-10 15:49:24,125] {dag_processing.py:495} INFO - Processor for /Users/xiang/Documents/BigData/airflow/dags/my_tutorial_2.py finished
[2018-09-10 15:49:25,133] {dag_processing.py:582} INFO - Started a process (PID: 29258) to generate tasks for /Users/xiang/Documents/BigData/airflow/dags/my_tutorial_2.py
[2018-09-10 15:49:25,560] {jobs.py:1108} INFO - No tasks to consider for execution.
[2018-09-10 15:49:25,561] {dag_processing.py:495} INFO - Processor for /Users/xiang/Documents/BigData/airflow/dags/my_tutorial_2.py finished
[2018-09-10 15:49:26,567] {dag_processing.py:582} INFO - Started a process (PID: 29259) to generate tasks for /Users/xiang/Documents/BigData/airflow/dags/my_tutorial_2.py
[2018-09-10 15:49:26,993] {jobs.py:1108} INFO - No tasks to consider for execution.
[2018-09-10 15:49:27,001] {dag_processing.py:495} INFO - Processor for /Users/xiang/Documents/BigData/airflow/dags/my_tutorial_2.py finished
[2018-09-10 15:49:28,009] {dag_processing.py:582} INFO - Started a process (PID: 29260) to generate tasks for /Users/xiang/Documents/BigData/airflow/dags/my_tutorial_2.py
[2018-09-10 15:49:28,439] {jobs.py:1108} INFO - No tasks to consider for execution.
[2018-09-10 15:49:28,440] {dag_processing.py:495} INFO - Processor for /Users/xiang/Documents/BigData/airflow/dags/my_tutorial_2.py finished
[2018-09-10 15:49:29,445] {dag_processing.py:582} INFO - Started a process (PID: 29261) to generate tasks for /Users/xiang/Documents/BigData/airflow/dags/my_tutorial_2.py
[2018-09-10 15:49:29,872] {jobs.py:1108} INFO - No tasks to consider for execution.
[2018-09-10 15:49:29,873] {dag_processing.py:495} INFO - Processor for /Users/xiang/Documents/BigData/airflow/dags/my_tutorial_2.py finished
[2018-09-10 15:49:30,876] {dag_processing.py:582} INFO - Started a process (PID: 29263) to generate tasks for /Users/xiang/Documents/BigData/airflow/dags/my_tutorial_2.py
[2018-09-10 15:49:31,309] {jobs.py:1108} INFO - No tasks to consider for execution.

The scheduler will "heartbeat" your dag files based on the contents of your airflow.cfg. The two settings that probably most relevant to this are:
min_file_process_interval: How many seconds to wait between file-parsing loops to prevent the logs from being spammed.
scheduler_heartbeat_sec: The scheduler constantly tries to trigger new tasks (look at the scheduler section in the docs for more information). This defines how often the scheduler should run (in seconds).
Consider changing these if you are only running a few DAGs with tasks that are not run very often.

Related

Airflow task callbacks are missed sometimes

Aiflow: 2.1.2 -
Executor: KubernetesExecutor -
Python: 3.7
I have written tasks using Airflow 2+ TaskFlow API and running the Airflow application in KubernetesExecutor mode. There are success and failure callbacks on the task but sometimes they get missed.
I've tried to specify the callbacks both via default_args on DAG and directly in the task decorator but seeing same behaviour.
#task(
on_success_callback=common.on_success_callback,
on_failure_callback=common.on_failure_callback,
)
def delta_load_pstn(files):
# doing something here
Here are the closing logs of the task
2022-04-26 11:21:38,494] Marking task as SUCCESS. dag_id=delta_load_pstn, task_id=dq_process, execution_date=20220426T112104, start_date=20220426T112131, end_date=20220426T112138
[2022-04-26 11:21:38,548] 1 downstream tasks scheduled from follow-on schedule check
[2022-04-26 11:21:42,069] State of this instance has been externally set to success. Terminating instance.
[2022-04-26 11:21:42,070] Sending Signals.SIGTERM to GPID 34
[2022-04-26 11:22:42,081] process psutil.Process(pid=34, name='airflow task runner: delta_load_pstn dq_process 2022-04-26T11:21:04.747263+00:00 500', status='sleeping', started='11:21:31') did not respond to SIGTERM. Trying SIGKILL
[2022-04-26 11:22:42,095] Process psutil.Process(pid=34, name='airflow task runner: delta_load_pstn dq_process 2022-04-26T11:21:04.747263+00:00 500', status='terminated', exitcode=<Negsignal.SIGKILL: -9>, started='11:21:31') (34) terminated with exit code Negsignal.SIGKILL
[2022-04-26 11:22:42,095] Job 500 was killed before it finished (likely due to running out of memory)
And i can see in the task instance details that the callbacks are configured.
If I implement the on_execute_callback which is called before the execution of the task, I do get the alert (in Slack). So my guess is it's definitely something with killing the pod before the callback is handled.

Airlfow Execution Timeout not working well

I've set 'execution_timeout': timedelta(seconds=300) parameter on many tasks. When the execution timeout is set on task downloading data from Google Analytics it works properly - after ~300 seconds is the task set to failed. The task downloads some data from API (python), then it does some transformations (python) and loads data into PostgreSQL.
Then I've a task which executes only one PostgreSQL function - execution sometimes takes more than 300 seconds but I get this (task is marked as finished successfully).
*** Reading local file: /home/airflow/airflow/logs/bulk_replication_p2p_realtime/t1/2020-07-20T00:05:00+00:00/1.log
[2020-07-20 05:05:35,040] {__init__.py:1139} INFO - Dependencies all met for <TaskInstance: bulk_replication_p2p_realtime.t1 2020-07-20T00:05:00+00:00 [queued]>
[2020-07-20 05:05:35,051] {__init__.py:1139} INFO - Dependencies all met for <TaskInstance: bulk_replication_p2p_realtime.t1 2020-07-20T00:05:00+00:00 [queued]>
[2020-07-20 05:05:35,051] {__init__.py:1353} INFO -
--------------------------------------------------------------------------------
[2020-07-20 05:05:35,051] {__init__.py:1354} INFO - Starting attempt 1 of 1
[2020-07-20 05:05:35,051] {__init__.py:1355} INFO -
--------------------------------------------------------------------------------
[2020-07-20 05:05:35,098] {__init__.py:1374} INFO - Executing <Task(PostgresOperator): t1> on 2020-07-20T00:05:00+00:00
[2020-07-20 05:05:35,099] {base_task_runner.py:119} INFO - Running: ['airflow', 'run', 'bulk_replication_p2p_realtime', 't1', '2020-07-20T00:05:00+00:00', '--job_id', '958216', '--raw', '-sd', 'DAGS_FOLDER/bulk_replication_p2p_realtime.py', '--cfg_path', '/tmp/tmph11tn6fe']
[2020-07-20 05:05:37,348] {base_task_runner.py:101} INFO - Job 958216: Subtask t1 [2020-07-20 05:05:37,347] {settings.py:182} INFO - settings.configure_orm(): Using pool settings. pool_size=10, pool_recycle=1800, pid=26244
[2020-07-20 05:05:39,503] {base_task_runner.py:101} INFO - Job 958216: Subtask t1 [2020-07-20 05:05:39,501] {__init__.py:51} INFO - Using executor LocalExecutor
[2020-07-20 05:05:39,857] {base_task_runner.py:101} INFO - Job 958216: Subtask t1 [2020-07-20 05:05:39,856] {__init__.py:305} INFO - Filling up the DagBag from /home/airflow/airflow/dags/bulk_replication_p2p_realtime.py
[2020-07-20 05:05:39,894] {base_task_runner.py:101} INFO - Job 958216: Subtask t1 [2020-07-20 05:05:39,894] {cli.py:517} INFO - Running <TaskInstance: bulk_replication_p2p_realtime.t1 2020-07-20T00:05:00+00:00 [running]> on host dwh2-airflow-dev
[2020-07-20 05:05:39,938] {postgres_operator.py:62} INFO - Executing: CALL dw_system.bulk_replicate(p_graph_name=>'replication_p2p_realtime',p_group_size=>4 , p_group=>1, p_dag_id=>'bulk_replication_p2p_realtime', p_task_id=>'t1')
[2020-07-20 05:05:39,960] {logging_mixin.py:95} INFO - [2020-07-20 05:05:39,953] {base_hook.py:83} INFO - Using connection to: id: postgres_warehouse. Host: XXX Port: 5432, Schema: XXXX Login: XXX Password: XXXXXXXX, extra: {}
[2020-07-20 05:05:39,973] {logging_mixin.py:95} INFO - [2020-07-20 05:05:39,972] {dbapi_hook.py:171} INFO - CALL dw_system.bulk_replicate(p_graph_name=>'replication_p2p_realtime',p_group_size=>4 , p_group=>1, p_dag_id=>'bulk_replication_p2p_realtime', p_task_id=>'t1')
[2020-07-20 05:23:21,450] {logging_mixin.py:95} INFO - [2020-07-20 05:23:21,449] {timeout.py:42} ERROR - Process timed out, PID: 26244
[2020-07-20 05:23:36,453] {logging_mixin.py:95} INFO - [2020-07-20 05:23:36,452] {jobs.py:2562} INFO - Task exited with return code 0
Does anyone know how to enforce execution timeout out for such long running functions? It seems that the execution timeout is evaluated once the PG function finish.
Airflow uses the signal module from the standard library to affect a timeout. In Airflow it's used to hook into these system signals and request that the calling process be notified in N seconds and, should the process still be inside the context (see the __enter__ and __exit__ methods on the class) it will raise an AirflowTaskTimeout exception.
Unfortunately for this situation, there are certain classes of system operations that cannot be interrupted. This is actually called out in the signal documentation:
A long-running calculation implemented purely in C (such as regular expression matching on a large body of text) may run uninterrupted for an arbitrary amount of time, regardless of any signals received. The Python signal handlers will be called when the calculation finishes.
To which we say "But I'm not doing a long-running calculation in C!" -- yeah for Airflow this is almost always due to uninterruptable I/O operations.
The highlighted sentence above (emphasis mine) nicely explains why the handler is still triggered even after the task is allowed to (frustratingly!) finish, well beyond your requested timeout.

Scheduled DAG is not running. How do i diagnose the problem?

I am using Airflow 1.8.1. I have a DAG that I believe I have scheduled to run every 5 minutes, but it isn't doing so:
Ignore the 2 successful DAG runs, those were manually triggered.
I look at the scheduler log for that DAG and I see:
[2019-04-26 22:03:35,601] {jobs.py:343} DagFileProcessor839 INFO - Started process (PID=5653) to work on /usr/local/airflow/dags/retrieve_airflow_artifacts.py
[2019-04-26 22:03:35,606] {jobs.py:1525} DagFileProcessor839 INFO - Processing file /usr/local/airflow/dags/retrieve_airflow_artifacts.py for tasks to queue
[2019-04-26 22:03:35,607] {models.py:168} DagFileProcessor839 INFO - Filling up the DagBag from /usr/local/airflow/dags/retrieve_airflow_artifacts.py
[2019-04-26 22:03:36,083] {jobs.py:1539} DagFileProcessor839 INFO - DAG(s) ['retrieve_airflow_artifacts'] retrieved from /usr/local/airflow/dags/retrieve_airflow_artifacts.py
[2019-04-26 22:03:36,112] {jobs.py:1172} DagFileProcessor839 INFO - Processing retrieve_airflow_artifacts
[2019-04-26 22:03:36,126] {jobs.py:566} DagFileProcessor839 INFO - Skipping SLA check for <DAG: retrieve_airflow_artifacts> because no tasks in DAG have SLAs
[2019-04-26 22:03:36,132] {models.py:323} DagFileProcessor839 INFO - Finding 'running' jobs without a recent heartbeat
[2019-04-26 22:03:36,132] {models.py:329} DagFileProcessor839 INFO - Failing jobs without heartbeat after 2019-04-26 21:58:36.132768
[2019-04-26 22:03:36,139] {jobs.py:351} DagFileProcessor839 INFO - Processing /usr/local/airflow/dags/retrieve_airflow_artifacts.py took 0.539 seconds
[2019-04-26 22:04:06,776] {jobs.py:343} DagFileProcessor845 INFO - Started process (PID=5678) to work on /usr/local/airflow/dags/retrieve_airflow_artifacts.py
[2019-04-26 22:04:06,780] {jobs.py:1525} DagFileProcessor845 INFO - Processing file /usr/local/airflow/dags/retrieve_airflow_artifacts.py for tasks to queue
[2019-04-26 22:04:06,780] {models.py:168} DagFileProcessor845 INFO - Filling up the DagBag from /usr/local/airflow/dags/retrieve_airflow_artifacts.py
[2019-04-26 22:04:07,258] {jobs.py:1539} DagFileProcessor845 INFO - DAG(s) ['retrieve_airflow_artifacts'] retrieved from /usr/local/airflow/dags/retrieve_airflow_artifacts.py
[2019-04-26 22:04:07,287] {jobs.py:1172} DagFileProcessor845 INFO - Processing retrieve_airflow_artifacts
[2019-04-26 22:04:07,301] {jobs.py:566} DagFileProcessor845 INFO - Skipping SLA check for <DAG: retrieve_airflow_artifacts> because no tasks in DAG have SLAs
[2019-04-26 22:04:07,307] {models.py:323} DagFileProcessor845 INFO - Finding 'running' jobs without a recent heartbeat
[2019-04-26 22:04:07,307] {models.py:329} DagFileProcessor845 INFO - Failing jobs without heartbeat after 2019-04-26 21:59:07.307607
[2019-04-26 22:04:07,314] {jobs.py:351} DagFileProcessor845 INFO - Processing /usr/local/airflow/dags/retrieve_airflow_artifacts.py took 0.538 seconds
over and over again. I've compared that to a DAG on another server and from doing so I know that there would be extra log records indicating that the DAG had been triggered via a schedule, there are no such records in this log file.
Here's how the schedule of my DAG is defined:
args = {
'owner': 'airflow',
'start_date': (datetime.datetime.now() - datetime.timedelta(minutes=5))
}
dag = DAG(
dag_id='retrieve_airflow_artifacts', default_args=args,
schedule_interval="0,5,10,15,20,25,30,35,40,45,50,55 * * * *")
Could someone help me to figure out why my DAG isn't running because I've looked high and low and cannot figure it out.
If I had to guess, I would say your start_date is causing you some issues.
Change your args to have a static start and prevent it from running on past intervals:
args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2019, 4, 27) #year month day
}
Also, just to make it easier to read, change your DAG args to (same functionality):
dag = DAG(
dag_id='retrieve_airflow_artifacts',
default_args=args,
schedule_interval="*/5 * * * *"
)
That should allow the scheduler to pick it up!
It's generally recommended not to set your start_date dynamically.
Taken from Airflow FAQ:
We recommend against using dynamic values as start_date, especially
datetime.now() as it can be quite confusing. The task is triggered
once the period closes, and in theory an #hourly DAG would never get
to an hour after now as now() moves along.
Another SO question on this: why dynamic start dates cause issues

Airflow 1.9.0 is queuing but tasks are not running

Airflow stopped running tasks all of a sudden. Below are all running
airflow scheduler
airflow webserver
airflow worker
webui message
All dependencies are met but the task instance is not running. In most
cases this just means that the task will probably be scheduled soon
unless:
- The scheduler is down or under heavy load
If this task instance does not start soon please contact your Airflow
administrator for assistance.
Scheduler seems to be in a loop, keeps repeating the below messages. WebUI shows tasks are in queued state. Tried restarting the scheduler, didn't help.
[2018-11-17 22:03:45,809] {{jobs.py:1607}} DEBUG - Starting Loop...
[2018-11-17 22:03:45,809] {{jobs.py:1627}} INFO - Heartbeating the process manager
[2018-11-17 22:03:45,810] {{jobs.py:1662}} INFO - Heartbeating the executor
[2018-11-17 22:03:45,810] {{base_executor.py:103}} DEBUG - 124 running task instances
[2018-11-17 22:03:45,810] {{base_executor.py:104}} DEBUG - 0 in queue
[2018-11-17 22:03:45,810] {{base_executor.py:105}} DEBUG - 76 open slots
[2018-11-17 22:03:45,810] {{base_executor.py:132}} DEBUG - Calling the <class 'airflow.executors.celery_executor.CeleryExecutor'> sync method
[2018-11-17 22:03:45,810] {{celery_executor.py:80}} DEBUG - Inquiring about 124 celery task(s)
Airflow setup:
apache-airflow[celery, redis, all]==1.9.0
I also checked these posts but didn't help me:
Airflow 1.9.0 is queuing but not launching tasks
Airflow tasks get stuck at "queued" status and never gets running
Problem solved. This is a problem when you create your build on or after 2018-11-15 Turns out apache-airflow[celery, redis, all]==1.9.0 takes the latest version of redis-py 3.0.1 which does not work with celery 4.2.1.
Solution is to use redis-py 2.10.6
redis==2.10.6
apache-airflow[celery, all]==1.9.0

Airflow execution_timeout settings not respected

In my tasks, I have execution_timeout=timedelta(minutes=1) set in my task and 'dagrun_timeout': timedelta(minutes=2) for my DAG, and this is correctly reflected in the web GUI's Task Instance Details. However, none of my task instances are actually set to failed or retry when breaching the one minute threshold. Rather, they time out at 11 minutes...
[2017-11-02 18:00:05,376] {base_task_runner.py:95} INFO - Subtask: [2017-11-02 18:00:05,370] {base_hook.py:67} INFO - Using connection to: [REDACTED]
[2017-11-02 18:10:06,505] {base_task_runner.py:95} INFO - Subtask: [2017-11-02 18:10:06,504] {timeout.py:37} ERROR - Process timed out
Do I have a problem with my configuration, or is there something buggy happening with how Airflow interprets time out settings?

Resources