About ExternalTaskMarker & ExternalTaskSensor in Apache Airflow - airflow

Suppose there are two DAGs i.e., DAG1 and DAG2. should i need to schedule ExternalTaskMarker
in DAG1 and ExternalTaskSensor at same time? or how can i schedule it
There are two DAGs namely DAG1 & DAG2 . I define ExternalTaskMarker in DAG1 which is scheduled at 15 * * * * and defined ExternalTaskSensor in DAG2 which is scheduled at 0 * * * * . i see that ExternalTaskMarker is running fine in DAG1 and in DAG2 ExternalTaskSensor is failing because of belo error
[2022-11-05 15:32:12,091] {{taskinstance.py:901}} INFO - Executing <Task(ExternalTaskSensor): external_task_sensor-plume_dev_download_dimensions_hourly-pan.device_master_dimension> on 2022-11-03T20:00:00+00:00
[2022-11-05 15:32:12,095] {{standard_task_runner.py:54}} INFO - Started process 48476 to run task
[2022-11-05 15:32:12,123] {{standard_task_runner.py:77}} INFO - Running: ['airflow', 'run', 'sd_plume_dev_incoming_data_quality_hourly-pan', 'external_task_sensor-plume_dev_download_dimensions_hourly-pan.device_master_dimension', '2022-11-03T20:00:00+00:00', '--job_id', '56965917', '--pool', 'default_pool', '--raw', '-sd', 'DAGS_FOLDER/sd_plume_dev_incoming_data_quality_hourly-pan.py', '--cfg_path', '/tmp/tmpy1p4tag5']
[2022-11-05 15:32:12,123] {{standard_task_runner.py:78}} INFO - Job 56965917: Subtask external_task_sensor-plume_dev_download_dimensions_hourly-pan.device_master_dimension
[2022-11-05 15:32:12,192] {{logging_mixin.py:112}} INFO - Running %s on host %s <TaskInstance: sd_plume_dev_incoming_data_quality_hourly-pan.external_task_sensor-plume_dev_download_dimensions_hourly-pan.device_master_dimension 2022-11-03T20:00:00+00:00 [running]> itclab21
[2022-11-05 15:32:12,278] {{external_task_sensor.py:117}} INFO - Poking for plume_dev_download_dimensions_hourly-pan.device_master_dimension on 2022-11-03T20:00:00+00:00 ...
[2022-11-05 15:32:12,301] {{taskinstance.py:1141}} INFO - Rescheduling task, marking task as UP_FOR_RESCHEDULE
[2022-11-05 15:32:17,075] {{local_task_job.py:102}} INFO - Task exited with return code 0

Related

Airflow: How to use the same tasks in different dags

I am learning Airflow and run into a problem.
I have 2 tasks, that I want to use in several dags. The difference between these tasks will be only the parameters the operators are going to get.
This could be accomplished by simply copy and pasting the tasks into all the dags, but maintain this type of code would be a nightmare.
So what a want to do is to create a class that will contain the tasks I will be calling several times and just import this class from the dags.
I replicated the issue with a minimal example.
This is the code for the class:
from airflow.operators.bash_operator import BashOperator
class Operator_generator():
_instance = None
def __init__(self, var1, var2):
self.var1 = var1
self.var2 = var2
def create_task_1(self):
return BashOperator(
task_id='task1',
bash_command='echo Im running task 1, the current execution date is {{ds}} and the previous execution date is {{prev_ds}}'
)
def create_task_2(self):
return BashOperator(
task_id='task2',
bash_command='echo Im running task 2, the current execution date is {{ds}} and the previous execution date is {{prev_ds}}'
)
and this is a dag example where I would import the class
from include.src.date.decorator import DefaultDateTime
from airflow import DAG
from include.src.airflow.xcom import cleanup
from operator_creator import Operator_generator
dag_id = "dag1"
default_args = {
"owner": "airflow",
"depends_on_past": False,
"start_date": DefaultDateTime(2021, 6, 1),
'retries': 1
}
# Dag definition
with DAG(
dag_id,
schedule_interval='#monthly',
catchup=False,
on_failure_callback=cleanup,
on_success_callback=cleanup
) as dag:
dag.doc_md = __doc__
operator_generator = Operator_generator('var1','var2')
task1 = operator_generator.create_task_1()
task2 = operator_generator.create_task_2()
task1 >> task2
Note that 'var1' and 'var2' are variables that I need to parametrize the operators.
The problem is that when I run the dag the tasks run twice:
[2021-08-25 16:29:46,937] {taskinstance.py:880} INFO - Starting attempt 1 of 2
[2021-08-25 16:29:46,937] {taskinstance.py:881} INFO -
--------------------------------------------------------------------------------
[2021-08-25 16:29:46,955] {taskinstance.py:900} INFO - Executing <Task(BashOperator): task1> on 2021-07-01T06:00:00+00:00
[2021-08-25 16:29:46,961] {standard_task_runner.py:53} INFO - Started process 67689 to run task
[2021-08-25 16:29:47,011] {logging_mixin.py:112} INFO - Running %s on host %s <TaskInstance: dag1.task1 2021-07-01T06:00:00+00:00 [running]> 30b770753547
[2021-08-25 16:29:47,032] {bash_operator.py:113} INFO - Tmp dir root location:
/tmp
[2021-08-25 16:29:47,033] {bash_operator.py:136} INFO - Temporary script location: /tmp/airflowtmpixijgd4s/task1lfcwdvfa
[2021-08-25 16:29:47,033] {bash_operator.py:146} INFO - Running command: echo Im running task 1, the current execution date is 2021-07-01 and the previous execution date is 2021-06-01
[2021-08-25 16:29:47,039] {bash_operator.py:153} INFO - Output:
[2021-08-25 16:29:47,040] {bash_operator.py:157} INFO - Im running task 1, the current execution date is 2021-07-01 and the previous execution date is 2021-06-01
[2021-08-25 16:29:47,040] {bash_operator.py:161} INFO - Command exited with return code 0
[2021-08-25 16:29:47,052] {taskinstance.py:1065} INFO - Marking task as SUCCESS.dag_id=dag1, task_id=task1, execution_date=20210701T060000, start_date=20210825T162946, end_date=20210825T162947
[2021-08-25 16:29:55,335] {taskinstance.py:669} INFO - Dependencies all met for <TaskInstance: dag1.task1 2021-08-25T16:29:41+00:00 [queued]>
[2021-08-25 16:29:55,335] {taskinstance.py:669} INFO - Dependencies all met for <TaskInstance: dag1.task2 2021-07-01T06:00:00+00:00 [queued]>
[2021-08-25 16:29:55,348] {taskinstance.py:669} INFO - Dependencies all met for <TaskInstance: dag1.task1 2021-08-25T16:29:41+00:00 [queued]>
[2021-08-25 16:29:55,348] {taskinstance.py:879} INFO -
--------------------------------------------------------------------------------
[2021-08-25 16:29:55,348] {taskinstance.py:880} INFO - Starting attempt 1 of 2
[2021-08-25 16:29:55,348] {taskinstance.py:881} INFO -
--------------------------------------------------------------------------------
[2021-08-25 16:29:55,357] {taskinstance.py:669} INFO - Dependencies all met for <TaskInstance: dag1.task2 2021-07-01T06:00:00+00:00 [queued]>
[2021-08-25 16:29:55,357] {taskinstance.py:879} INFO -
--------------------------------------------------------------------------------
[2021-08-25 16:29:55,357] {taskinstance.py:880} INFO - Starting attempt 1 of 2
[2021-08-25 16:29:55,357] {taskinstance.py:881} INFO -
--------------------------------------------------------------------------------
[2021-08-25 16:29:55,363] {taskinstance.py:900} INFO - Executing <Task(BashOperator): task1> on 2021-08-25T16:29:41+00:00
[2021-08-25 16:29:55,366] {standard_task_runner.py:53} INFO - Started process 67809 to run task
[2021-08-25 16:29:55,370] {taskinstance.py:900} INFO - Executing <Task(BashOperator): task2> on 2021-07-01T06:00:00+00:00
[2021-08-25 16:29:55,374] {standard_task_runner.py:53} INFO - Started process 67810 to run task
[2021-08-25 16:29:55,412] {logging_mixin.py:112} INFO - Running %s on host %s <TaskInstance: dag1.task1 2021-08-25T16:29:41+00:00 [running]> 30b770753547
[2021-08-25 16:29:55,422] {logging_mixin.py:112} INFO - Running %s on host %s <TaskInstance: dag1.task2 2021-07-01T06:00:00+00:00 [running]> 30b770753547
[2021-08-25 16:29:55,430] {bash_operator.py:113} INFO - Tmp dir root location:
/tmp
[2021-08-25 16:29:55,432] {bash_operator.py:136} INFO - Temporary script location: /tmp/airflowtmpsacovlfm/task1doc6fakb
[2021-08-25 16:29:55,432] {bash_operator.py:146} INFO - Running command: echo Im running task 1, the current execution date is 2021-08-25 and the previous execution date is 2021-08-25
[2021-08-25 16:29:55,440] {bash_operator.py:153} INFO - Output:
[2021-08-25 16:29:55,440] {bash_operator.py:157} INFO - Im running task 1, the current execution date is 2021-08-25 and the previous execution date is 2021-08-25
[2021-08-25 16:29:55,441] {bash_operator.py:161} INFO - Command exited with return code 0
[2021-08-25 16:29:55,444] {bash_operator.py:113} INFO - Tmp dir root location:
/tmp
[2021-08-25 16:29:55,445] {bash_operator.py:136} INFO - Temporary script location: /tmp/airflowtmpyqqww8an/task2i29a2lk7
[2021-08-25 16:29:55,445] {bash_operator.py:146} INFO - Running command: echo Im running task 2, the current execution date is 2021-07-01 and the previous execution date is 2021-06-01
[2021-08-25 16:29:55,451] {taskinstance.py:1065} INFO - Marking task as SUCCESS.dag_id=dag1, task_id=task1, execution_date=20210825T162941, start_date=20210825T162955, end_date=20210825T162955
[2021-08-25 16:29:55,453] {bash_operator.py:153} INFO - Output:
[2021-08-25 16:29:55,453] {bash_operator.py:157} INFO - Im running task 2, the current execution date is 2021-07-01 and the previous execution date is 2021-06-01
[2021-08-25 16:29:55,454] {bash_operator.py:161} INFO - Command exited with return code 0
[2021-08-25 16:29:55,465] {taskinstance.py:1065} INFO - Marking task as SUCCESS.dag_id=dag1, task_id=task2, execution_date=20210701T060000, start_date=20210825T162955, end_date=20210825T162955
[2021-08-25 16:29:56,922] {logging_mixin.py:112} INFO - [2021-08-25 16:29:56,921] {local_task_job.py:103} INFO - Task exited with return code 0
[2021-08-25 16:30:05,333] {logging_mixin.py:112} INFO - [2021-08-25 16:30:05,333] {local_task_job.py:103} INFO - Task exited with return code 0
[2021-08-25 16:30:05,337] {logging_mixin.py:112} INFO - [2021-08-25 16:30:05,337] {local_task_job.py:103} INFO - Task exited with return code 0
[2021-08-25 16:30:06,794] {taskinstance.py:669} INFO - Dependencies all met for <TaskInstance: dag1.task2 2021-08-25T16:29:41+00:00 [queued]>
[2021-08-25 16:30:06,809] {taskinstance.py:669} INFO - Dependencies all met for <TaskInstance: dag1.task2 2021-08-25T16:29:41+00:00 [queued]>
[2021-08-25 16:30:06,809] {taskinstance.py:879} INFO -
--------------------------------------------------------------------------------
[2021-08-25 16:30:06,810] {taskinstance.py:880} INFO - Starting attempt 1 of 2
[2021-08-25 16:30:06,810] {taskinstance.py:881} INFO -
--------------------------------------------------------------------------------
[2021-08-25 16:30:06,822] {taskinstance.py:900} INFO - Executing <Task(BashOperator): task2> on 2021-08-25T16:29:41+00:00
[2021-08-25 16:30:06,826] {standard_task_runner.py:53} INFO - Started process 67937 to run task
[2021-08-25 16:30:06,875] {logging_mixin.py:112} INFO - Running %s on host %s <TaskInstance: dag1.task2 2021-08-25T16:29:41+00:00 [running]> 30b770753547
[2021-08-25 16:30:06,892] {bash_operator.py:113} INFO - Tmp dir root location:
/tmp
[2021-08-25 16:30:06,893] {bash_operator.py:136} INFO - Temporary script location: /tmp/airflowtmpot_xsukw/task2xo4uxspu
[2021-08-25 16:30:06,893] {bash_operator.py:146} INFO - Running command: echo Im running task 2, the current execution date is 2021-08-25 and the previous execution date is 2021-08-25
[2021-08-25 16:30:06,901] {bash_operator.py:153} INFO - Output:
[2021-08-25 16:30:06,902] {bash_operator.py:157} INFO - Im running task 2, the current execution date is 2021-08-25 and the previous execution date is 2021-08-25
[2021-08-25 16:30:06,902] {bash_operator.py:161} INFO - Command exited with return code 0
[2021-08-25 16:30:06,913] {taskinstance.py:1065} INFO - Marking task as SUCCESS.dag_id=dag1, task_id=task2, execution_date=20210825T162941, start_date=20210825T163006, end_date=20210825T163006
[2021-08-25 16:30:16,800] {logging_mixin.py:112} INFO - [2021-08-25 16:30:16,799] {local_task_job.py:103} INFO - Task exited with return code 0
Notice how the tasks are executed 2 times:
In the first execution the values of {{ds}} and {{prev_ds}} are the current date.
In the second execution the values of {{ds}} and {{prev_ds}} correspond to the monthly interval.
Why the tasks run 2 times?
Is there a way to import tasks like this?
Note 1: I am not allowed to use subdags.
Edit: Adding the execution tree
Edit 2:
If anyone run into this problem I figured out.
The problem was that I was running the dag with an external trigger, that deletes the dag and start it again. So the the dag runs for the external trigger, but also the scheduler sees that the dag hasn't run for the month, so it schedules the execution, resulting in 2 runs.
The solution I found is:
Turn off the dag in the airflow interface
Delete the dag (with the red x in the far right of the dag)
Refresh the page
The dag appears again in the list, turn it on
This will make the scheduler do its job and the dag will run as it should.

Why are my Airflow tasks being "externally set to failed"?

I'm using Airflow 2.0.0, and my tasks are sporadically being killed "externally" after running for a few seconds or minutes. The tasks usually run successfully (both for manual task initiated via airflow tasks test ... and for scheduled DAG runs), so I believe this is not related to my DAG code.
When tasks fail, this seems to be the key error from the task logs:
{local_task_job.py:170} WARNING - State of this instance has been externally set to failed. Terminating instance.
[2020-12-20 11:26:11,448] {taskinstance.py:826} INFO - Dependencies all met for <TaskInstance: daily_backups.run_backupper 2020-12-19T02:00:00+00:00 [queued]>
[2020-12-20 11:26:11,473] {taskinstance.py:826} INFO - Dependencies all met for <TaskInstance: daily_backups.run_backupper 2020-12-19T02:00:00+00:00 [queued]>
[2020-12-20 11:26:11,473] {taskinstance.py:1017} INFO -
--------------------------------------------------------------------------------
[2020-12-20 11:26:11,473] {taskinstance.py:1018} INFO - Starting attempt 3 of 3
[2020-12-20 11:26:11,473] {taskinstance.py:1019} INFO -
--------------------------------------------------------------------------------
[2020-12-20 11:26:11,506] {taskinstance.py:1038} INFO - Executing <Task(PythonOperator): run_backupper> on 2020-12-19T02:00:00+00:00
[2020-12-20 11:26:11,509] {standard_task_runner.py:51} INFO - Started process 12059 to run task
[2020-12-20 11:26:11,515] {standard_task_runner.py:75} INFO - Running: ['airflow', 'tasks', 'run', 'daily_backups', 'run_backupper', '2020-12-19T02:00:00+00:00', '--job-id', '22', '--pool', 'default_pool', '--raw', '--subdir', 'DAGS_FOLDER/backupper/daily_backups.py', '--cfg-path', '/tmp/tmpnfmqtorg']
[2020-12-20 11:26:11,517] {standard_task_runner.py:76} INFO - Job 22: Subtask run_backupper
[2020-12-20 11:26:11,609] {logging_mixin.py:103} INFO - Running <TaskInstance: daily_backups.run_backupper 2020-12-19T02:00:00+00:00 [running]> on host localhost
[2020-12-20 11:26:11,742] {taskinstance.py:1232} INFO - Exporting the following env vars:
AIRFLOW_CTX_DAG_OWNER=<user>
AIRFLOW_CTX_DAG_ID=daily_backups
AIRFLOW_CTX_TASK_ID=run_backupper
AIRFLOW_CTX_EXECUTION_DATE=2020-12-19T02:00:00+00:00
AIRFLOW_CTX_DAG_RUN_ID=scheduled__2020-12-19T02:00:00+00:00
...
... my job's logs, indicating that the job is running healthily ...
...
[2020-12-20 11:26:16,587] {local_task_job.py:170} WARNING - State of this instance has been externally set to failed. Terminating instance.
[2020-12-20 11:26:16,593] {process_utils.py:95} INFO - Sending Signals.SIGTERM to GPID 12059
[2020-12-20 11:27:16,609] {process_utils.py:108} WARNING - process psutil.Process(pid=12059, name='airflow task runner: daily_backups run_backupper 2020-12-19T02:00:00+00:00 22', status='sleeping', started='11:26:11') did not respond to SIGTERM. Trying SIGKILL
[2020-12-20 11:27:16,618] {process_utils.py:61} INFO - Process psutil.Process(pid=12059, name='airflow task runner: daily_backups run_backupper 2020-12-19T02:00:00+00:00 22', status='terminated', exitcode=<Negsignal.SIGKILL: -9>, started='11:26:11') (12059) terminated with exit code Negsignal.SIGKILL
[2020-12-20 11:27:16,618] {local_task_job.py:118} INFO - Task exited with return code Negsignal.SIGKILL
The final few lines in the logs are not consistent. Here is a different version, for the same task that failed in an earlier attempt:
... same stuff as before ...
[2020-12-20 02:01:12,689] {local_task_job.py:170} WARNING - State of this instance has been externally set to failed. Terminating instance.
[2020-12-20 02:01:12,695] {process_utils.py:95} INFO - Sending Signals.SIGTERM to GPID 24442
[2020-12-20 02:02:00,462] {taskinstance.py:1214} ERROR - Received SIGTERM. Terminating subprocesses.
[2020-12-20 02:02:00,498] {process_utils.py:61} INFO - Process psutil.Process(pid=24442, status='terminated', exitcode=0, started='02:00:10') (24442) terminated with exit code 0
[2020-12-20 02:02:00,499] {local_task_job.py:118} INFO - Task exited with return code 0
I suspect in this case the script was able to respond to the SIGTERM in time, whereas in the previous case it was blocked on a long-running query and was not able to terminate cleanly.
I believe the problem was that the scheduler health check threshold was set to be smaller than the scheduler heartbeat interval.
In my config I had set scheduler_health_check_threshold to 30 seconds and scheduler_heartbeat_sec to 60 seconds. During the check for orphaned tasks (itself governed by a different parameter, orphaned_tasks_check_interval), the scheduler heartbeat was determined to be older than 30 seconds, which makes sense, because it was only heartbeating every 60 seconds. Thus the scheduler was inferred to be unhealthy and was therefore terminated.
Around the time of the failure, I could see messages like these in /var/log/syslog
Dec 20 11:26:14 localhost bash[11545]: [2020-12-20 11:26:14,368] {scheduler_job.py:1751} INFO - Resetting orphaned tasks for active dag runs
Dec 20 11:26:14 localhost bash[11545]: [2020-12-20 11:26:14,373] {scheduler_job.py:1764} INFO - Marked 1 SchedulerJob instances as failed
Dec 20 11:26:14 localhost bash[11545]: [2020-12-20 11:26:14,381] {scheduler_job.py:1805} INFO - Reset the following 1 orphaned TaskInstances:
Dec 20 11:26:14 localhost bash[11545]: #011<TaskInstance: daily_backups.run_backupper 2020-12-19 02:00:00+00:00 [running]>
Dec 20 11:26:14 localhost bash[11545]: [2020-12-20 11:26:14,571] {scheduler_job.py:938} INFO - 1 tasks up for execution:
Dec 20 11:26:14 localhost bash[11545]: #011<TaskInstance: daily_backups.run_backupper 2020-12-19 02:00:00+00:00 [scheduled]>
Dec 20 11:26:14 localhost bash[11545]: [2020-12-20 11:26:14,574] {scheduler_job.py:972} INFO - Figuring out tasks to run in Pool(name=default_pool) with 128 open slots and 1 task instances ready to be queued
Dec 20 11:26:14 localhost bash[11545]: [2020-12-20 11:26:14,575] {scheduler_job.py:999} INFO - DAG daily_backups has 0/16 running and queued tasks
Dec 20 11:26:14 localhost bash[11545]: [2020-12-20 11:26:14,575] {scheduler_job.py:1060} INFO - Setting the following tasks to queued state:
Dec 20 11:26:14 localhost bash[11545]: #011<TaskInstance: daily_backups.run_backupper 2020-12-19 02:00:00+00:00 [scheduled]>
Dec 20 11:26:14 localhost bash[11545]: [2020-12-20 11:26:14,578] {scheduler_job.py:1102} INFO - Sending TaskInstanceKey(dag_id='daily_backups', task_id='run_backupper', execution_date=datetime.datetime(2020, 12, 19, 2, 0, tzinfo=Timezone('UTC')), try_number=4) to executor with priority 2 and queue default
Dec 20 11:26:14 localhost bash[11545]: [2020-12-20 11:26:14,578] {base_executor.py:79} INFO - Adding to queue: ['airflow', 'tasks', 'run', 'daily_backups', 'run_backupper', '2020-12-19T02:00:00+00:00', '--local', '--pool', 'default_pool', '--subdir', '/storage/airflow/dags/backupper/daily_backups.py']
Dec 20 11:26:14 localhost bash[11545]: [2020-12-20 11:26:14,581] {local_executor.py:81} INFO - QueuedLocalWorker running ['airflow', 'tasks', 'run', 'daily_backups', 'run_backupper', '2020-12-19T02:00:00+00:00', '--local', '--pool', 'default_pool', '--subdir', '/storage/airflow/dags/backupper/daily_backups.py']
Dec 20 11:26:14 localhost bash[11545]: [2020-12-20 11:26:14,707] {dagbag.py:440} INFO - Filling up the DagBag from /storage/airflow/dags/backupper/daily_backups.py
Dec 20 11:26:15 localhost bash[11545]: Running <TaskInstance: daily_backups.run_backupper 2020-12-19T02:00:00+00:00 [queued]> on host localhost
and the timestamps coincide closely with the SIGTERM received by my task. I guess that since the SchedulerJob was marked as failed, then the TaskInstance running my actual task was considered an orphan, and thus marked for termination. At the same time it scheduled a new attempt (try_number=4).
Increasing the scheduler_health_check_threshold to 120 seconds and restarting the scheduler/webserver services appears to have resolved my issue.
I had the same issue.
From logs:
2021-05-07 13:04:19,960 INFO - Resetting orphaned tasks for active dag runs
2021-05-07 13:09:20,060 INFO - Resetting orphaned tasks for active dag runs
2021-05-07 13:14:20,186 INFO - Resetting orphaned tasks for active dag runs
2021-05-07 13:19:20,263 INFO - Resetting orphaned tasks for active dag runs
2021-05-07 13:24:20,399 INFO - Resetting orphaned tasks for active dag runs
2021-05-07 13:29:20,729 INFO - Resetting orphaned tasks for active dag runs
2021-05-07 13:34:20,892 INFO - Resetting orphaned tasks for active dag runs
2021-05-07 13:39:21,070 INFO - Resetting orphaned tasks for active dag runs
2021-05-07 13:44:21,328 INFO - Resetting orphaned tasks for active dag runs
2021-05-07 13:49:21,423 INFO - Resetting orphaned tasks for active dag runs
And my scheduler config was as follows:
# Task instances listen for external kill signal (when you clear tasks
# from the CLI or the UI), this defines the frequency at which they should
# listen (in seconds).
job_heartbeat_sec = 5
I had the same issue in AKS (Azure Kubernetes).
I resolved it with setting AIRFLOW__SCHEDULER__SCHEDULE_AFTER_TASK_EXECUTION to False.
https://github.com/apache/airflow/issues/14672
I had the same problem, running on a MacBook. Seems to be the MacBook going to sleep, solved it by ticking "Prevent your Mac from automatically sleeping when the display is off" in the "Power Adapter" section of preferences -> battery. (When on a charger)

Airlfow Execution Timeout not working well

I've set 'execution_timeout': timedelta(seconds=300) parameter on many tasks. When the execution timeout is set on task downloading data from Google Analytics it works properly - after ~300 seconds is the task set to failed. The task downloads some data from API (python), then it does some transformations (python) and loads data into PostgreSQL.
Then I've a task which executes only one PostgreSQL function - execution sometimes takes more than 300 seconds but I get this (task is marked as finished successfully).
*** Reading local file: /home/airflow/airflow/logs/bulk_replication_p2p_realtime/t1/2020-07-20T00:05:00+00:00/1.log
[2020-07-20 05:05:35,040] {__init__.py:1139} INFO - Dependencies all met for <TaskInstance: bulk_replication_p2p_realtime.t1 2020-07-20T00:05:00+00:00 [queued]>
[2020-07-20 05:05:35,051] {__init__.py:1139} INFO - Dependencies all met for <TaskInstance: bulk_replication_p2p_realtime.t1 2020-07-20T00:05:00+00:00 [queued]>
[2020-07-20 05:05:35,051] {__init__.py:1353} INFO -
--------------------------------------------------------------------------------
[2020-07-20 05:05:35,051] {__init__.py:1354} INFO - Starting attempt 1 of 1
[2020-07-20 05:05:35,051] {__init__.py:1355} INFO -
--------------------------------------------------------------------------------
[2020-07-20 05:05:35,098] {__init__.py:1374} INFO - Executing <Task(PostgresOperator): t1> on 2020-07-20T00:05:00+00:00
[2020-07-20 05:05:35,099] {base_task_runner.py:119} INFO - Running: ['airflow', 'run', 'bulk_replication_p2p_realtime', 't1', '2020-07-20T00:05:00+00:00', '--job_id', '958216', '--raw', '-sd', 'DAGS_FOLDER/bulk_replication_p2p_realtime.py', '--cfg_path', '/tmp/tmph11tn6fe']
[2020-07-20 05:05:37,348] {base_task_runner.py:101} INFO - Job 958216: Subtask t1 [2020-07-20 05:05:37,347] {settings.py:182} INFO - settings.configure_orm(): Using pool settings. pool_size=10, pool_recycle=1800, pid=26244
[2020-07-20 05:05:39,503] {base_task_runner.py:101} INFO - Job 958216: Subtask t1 [2020-07-20 05:05:39,501] {__init__.py:51} INFO - Using executor LocalExecutor
[2020-07-20 05:05:39,857] {base_task_runner.py:101} INFO - Job 958216: Subtask t1 [2020-07-20 05:05:39,856] {__init__.py:305} INFO - Filling up the DagBag from /home/airflow/airflow/dags/bulk_replication_p2p_realtime.py
[2020-07-20 05:05:39,894] {base_task_runner.py:101} INFO - Job 958216: Subtask t1 [2020-07-20 05:05:39,894] {cli.py:517} INFO - Running <TaskInstance: bulk_replication_p2p_realtime.t1 2020-07-20T00:05:00+00:00 [running]> on host dwh2-airflow-dev
[2020-07-20 05:05:39,938] {postgres_operator.py:62} INFO - Executing: CALL dw_system.bulk_replicate(p_graph_name=>'replication_p2p_realtime',p_group_size=>4 , p_group=>1, p_dag_id=>'bulk_replication_p2p_realtime', p_task_id=>'t1')
[2020-07-20 05:05:39,960] {logging_mixin.py:95} INFO - [2020-07-20 05:05:39,953] {base_hook.py:83} INFO - Using connection to: id: postgres_warehouse. Host: XXX Port: 5432, Schema: XXXX Login: XXX Password: XXXXXXXX, extra: {}
[2020-07-20 05:05:39,973] {logging_mixin.py:95} INFO - [2020-07-20 05:05:39,972] {dbapi_hook.py:171} INFO - CALL dw_system.bulk_replicate(p_graph_name=>'replication_p2p_realtime',p_group_size=>4 , p_group=>1, p_dag_id=>'bulk_replication_p2p_realtime', p_task_id=>'t1')
[2020-07-20 05:23:21,450] {logging_mixin.py:95} INFO - [2020-07-20 05:23:21,449] {timeout.py:42} ERROR - Process timed out, PID: 26244
[2020-07-20 05:23:36,453] {logging_mixin.py:95} INFO - [2020-07-20 05:23:36,452] {jobs.py:2562} INFO - Task exited with return code 0
Does anyone know how to enforce execution timeout out for such long running functions? It seems that the execution timeout is evaluated once the PG function finish.
Airflow uses the signal module from the standard library to affect a timeout. In Airflow it's used to hook into these system signals and request that the calling process be notified in N seconds and, should the process still be inside the context (see the __enter__ and __exit__ methods on the class) it will raise an AirflowTaskTimeout exception.
Unfortunately for this situation, there are certain classes of system operations that cannot be interrupted. This is actually called out in the signal documentation:
A long-running calculation implemented purely in C (such as regular expression matching on a large body of text) may run uninterrupted for an arbitrary amount of time, regardless of any signals received. The Python signal handlers will be called when the calculation finishes.
To which we say "But I'm not doing a long-running calculation in C!" -- yeah for Airflow this is almost always due to uninterruptable I/O operations.
The highlighted sentence above (emphasis mine) nicely explains why the handler is still triggered even after the task is allowed to (frustratingly!) finish, well beyond your requested timeout.

Airflow dag_id did not exist or it failed to parse

currently im learning how to use Apache Airflow and trying to create a simple DAG script like this
from datetime import datetime
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
def print_hello():
return 'Hello world!'
dag = DAG('hello_world', description='Simple tutorial DAG',
schedule_interval='0 0 * * *',
start_date=datetime(2020, 5, 23), catchup=False)
dummy_operator = DummyOperator(task_id='dummy_task', retries=3, dag=dag)
hello_operator = PythonOperator(task_id='hello_task', python_callable=print_hello, dag=dag)
dummy_operator >> hello_operator
i run those DAG using web server and run succesfully even checked the logs
[2020-05-23 20:43:53,411] {taskinstance.py:669} INFO - Dependencies all met for <TaskInstance: hello_world.hello_task 2020-05-23T13:42:17.463955+00:00 [queued]>
[2020-05-23 20:43:53,431] {taskinstance.py:669} INFO - Dependencies all met for <TaskInstance: hello_world.hello_task 2020-05-23T13:42:17.463955+00:00 [queued]>
[2020-05-23 20:43:53,432] {taskinstance.py:879} INFO -
--------------------------------------------------------------------------------
[2020-05-23 20:43:53,432] {taskinstance.py:880} INFO - Starting attempt 1 of 1
[2020-05-23 20:43:53,432] {taskinstance.py:881} INFO -
--------------------------------------------------------------------------------
[2020-05-23 20:43:53,448] {taskinstance.py:900} INFO - Executing <Task(PythonOperator): hello_task> on 2020-05-23T13:42:17.463955+00:00
[2020-05-23 20:43:53,477] {standard_task_runner.py:53} INFO - Started process 7442 to run task
[2020-05-23 20:43:53,685] {logging_mixin.py:112} INFO - Running %s on host %s <TaskInstance: hello_world.hello_task 2020-05-23T13:42:17.463955+00:00 [running]> LAPTOP-9BCTKM5O.localdomain
[2020-05-23 20:43:53,715] {python_operator.py:114} INFO - Done. Returned value was: Hello world!
[2020-05-23 20:43:53,738] {taskinstance.py:1052} INFO - Marking task as SUCCESS.dag_id=hello_world, task_id=hello_task, execution_date=20200523T134217, start_date=20200523T134353, end_date=20200523T134353
[2020-05-23 20:44:03,372] {logging_mixin.py:112} INFO - [2020-05-23 20:44:03,372] {local_task_job.py:103} INFO - Task exited with return code 0
but when i tried to test run a single task using this command
airflow test dags/main.py hello_task 2020-05-23
it shows this error
airflow.exceptions.AirflowException: dag_id could not be found: dags/main.py. Either the dag did not exist or it failed to parse.
where i went wrong ?
You got your airflow test command a tad wrong, instead of giving the path to the dag, dags/main.py, you need to type in the dag_id itself which is hello_world looking at your code.
So try this:
airflow test hello_world hello_task 2020-05-23
You should get output similar to this :)
airflow#940836ce7da4:/opt/airflow$ airflow test hello_world hello_task 2020-05-23
[2020-05-23 14:18:51,144] {__init__.py:51} INFO - Using executor CeleryExecutor
[2020-05-23 14:18:51,145] {dagbag.py:396} INFO - Filling up the DagBag from /opt/airflow/dags
[2020-05-23 14:18:51,190] {taskinstance.py:669} INFO - Dependencies all met for <TaskInstance: hello_world.hello_task 2020-05-23T00:00:00+00:00 [None]>
[2020-05-23 14:18:51,203] {taskinstance.py:669} INFO - Dependencies all met for <TaskInstance: hello_world.hello_task 2020-05-23T00:00:00+00:00 [None]>
[2020-05-23 14:18:51,203] {taskinstance.py:879} INFO -
--------------------------------------------------------------------------------
[2020-05-23 14:18:51,203] {taskinstance.py:880} INFO - Starting attempt 1 of 1
[2020-05-23 14:18:51,203] {taskinstance.py:881} INFO -
--------------------------------------------------------------------------------
[2020-05-23 14:18:51,204] {taskinstance.py:900} INFO - Executing <Task(PythonOperator): hello_task> on 2020-05-23T00:00:00+00:00
[2020-05-23 14:18:51,234] {python_operator.py:114} INFO - Done. Returned value was: Hello world!
[2020-05-23 14:18:51,249] {taskinstance.py:1065} INFO - Marking task as SUCCESS.dag_id=hello_world, task_id=hello_task, execution_date=20200523T000000, start_date=20200523T141851, end_date=20200523T141851
After af 2.0;
airflow tasks test dag_id task_id date

Scheduled DAG is not running. How do i diagnose the problem?

I am using Airflow 1.8.1. I have a DAG that I believe I have scheduled to run every 5 minutes, but it isn't doing so:
Ignore the 2 successful DAG runs, those were manually triggered.
I look at the scheduler log for that DAG and I see:
[2019-04-26 22:03:35,601] {jobs.py:343} DagFileProcessor839 INFO - Started process (PID=5653) to work on /usr/local/airflow/dags/retrieve_airflow_artifacts.py
[2019-04-26 22:03:35,606] {jobs.py:1525} DagFileProcessor839 INFO - Processing file /usr/local/airflow/dags/retrieve_airflow_artifacts.py for tasks to queue
[2019-04-26 22:03:35,607] {models.py:168} DagFileProcessor839 INFO - Filling up the DagBag from /usr/local/airflow/dags/retrieve_airflow_artifacts.py
[2019-04-26 22:03:36,083] {jobs.py:1539} DagFileProcessor839 INFO - DAG(s) ['retrieve_airflow_artifacts'] retrieved from /usr/local/airflow/dags/retrieve_airflow_artifacts.py
[2019-04-26 22:03:36,112] {jobs.py:1172} DagFileProcessor839 INFO - Processing retrieve_airflow_artifacts
[2019-04-26 22:03:36,126] {jobs.py:566} DagFileProcessor839 INFO - Skipping SLA check for <DAG: retrieve_airflow_artifacts> because no tasks in DAG have SLAs
[2019-04-26 22:03:36,132] {models.py:323} DagFileProcessor839 INFO - Finding 'running' jobs without a recent heartbeat
[2019-04-26 22:03:36,132] {models.py:329} DagFileProcessor839 INFO - Failing jobs without heartbeat after 2019-04-26 21:58:36.132768
[2019-04-26 22:03:36,139] {jobs.py:351} DagFileProcessor839 INFO - Processing /usr/local/airflow/dags/retrieve_airflow_artifacts.py took 0.539 seconds
[2019-04-26 22:04:06,776] {jobs.py:343} DagFileProcessor845 INFO - Started process (PID=5678) to work on /usr/local/airflow/dags/retrieve_airflow_artifacts.py
[2019-04-26 22:04:06,780] {jobs.py:1525} DagFileProcessor845 INFO - Processing file /usr/local/airflow/dags/retrieve_airflow_artifacts.py for tasks to queue
[2019-04-26 22:04:06,780] {models.py:168} DagFileProcessor845 INFO - Filling up the DagBag from /usr/local/airflow/dags/retrieve_airflow_artifacts.py
[2019-04-26 22:04:07,258] {jobs.py:1539} DagFileProcessor845 INFO - DAG(s) ['retrieve_airflow_artifacts'] retrieved from /usr/local/airflow/dags/retrieve_airflow_artifacts.py
[2019-04-26 22:04:07,287] {jobs.py:1172} DagFileProcessor845 INFO - Processing retrieve_airflow_artifacts
[2019-04-26 22:04:07,301] {jobs.py:566} DagFileProcessor845 INFO - Skipping SLA check for <DAG: retrieve_airflow_artifacts> because no tasks in DAG have SLAs
[2019-04-26 22:04:07,307] {models.py:323} DagFileProcessor845 INFO - Finding 'running' jobs without a recent heartbeat
[2019-04-26 22:04:07,307] {models.py:329} DagFileProcessor845 INFO - Failing jobs without heartbeat after 2019-04-26 21:59:07.307607
[2019-04-26 22:04:07,314] {jobs.py:351} DagFileProcessor845 INFO - Processing /usr/local/airflow/dags/retrieve_airflow_artifacts.py took 0.538 seconds
over and over again. I've compared that to a DAG on another server and from doing so I know that there would be extra log records indicating that the DAG had been triggered via a schedule, there are no such records in this log file.
Here's how the schedule of my DAG is defined:
args = {
'owner': 'airflow',
'start_date': (datetime.datetime.now() - datetime.timedelta(minutes=5))
}
dag = DAG(
dag_id='retrieve_airflow_artifacts', default_args=args,
schedule_interval="0,5,10,15,20,25,30,35,40,45,50,55 * * * *")
Could someone help me to figure out why my DAG isn't running because I've looked high and low and cannot figure it out.
If I had to guess, I would say your start_date is causing you some issues.
Change your args to have a static start and prevent it from running on past intervals:
args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2019, 4, 27) #year month day
}
Also, just to make it easier to read, change your DAG args to (same functionality):
dag = DAG(
dag_id='retrieve_airflow_artifacts',
default_args=args,
schedule_interval="*/5 * * * *"
)
That should allow the scheduler to pick it up!
It's generally recommended not to set your start_date dynamically.
Taken from Airflow FAQ:
We recommend against using dynamic values as start_date, especially
datetime.now() as it can be quite confusing. The task is triggered
once the period closes, and in theory an #hourly DAG would never get
to an hour after now as now() moves along.
Another SO question on this: why dynamic start dates cause issues

Resources