Airflow dag_id did not exist or it failed to parse - airflow

currently im learning how to use Apache Airflow and trying to create a simple DAG script like this
from datetime import datetime
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
def print_hello():
return 'Hello world!'
dag = DAG('hello_world', description='Simple tutorial DAG',
schedule_interval='0 0 * * *',
start_date=datetime(2020, 5, 23), catchup=False)
dummy_operator = DummyOperator(task_id='dummy_task', retries=3, dag=dag)
hello_operator = PythonOperator(task_id='hello_task', python_callable=print_hello, dag=dag)
dummy_operator >> hello_operator
i run those DAG using web server and run succesfully even checked the logs
[2020-05-23 20:43:53,411] {taskinstance.py:669} INFO - Dependencies all met for <TaskInstance: hello_world.hello_task 2020-05-23T13:42:17.463955+00:00 [queued]>
[2020-05-23 20:43:53,431] {taskinstance.py:669} INFO - Dependencies all met for <TaskInstance: hello_world.hello_task 2020-05-23T13:42:17.463955+00:00 [queued]>
[2020-05-23 20:43:53,432] {taskinstance.py:879} INFO -
--------------------------------------------------------------------------------
[2020-05-23 20:43:53,432] {taskinstance.py:880} INFO - Starting attempt 1 of 1
[2020-05-23 20:43:53,432] {taskinstance.py:881} INFO -
--------------------------------------------------------------------------------
[2020-05-23 20:43:53,448] {taskinstance.py:900} INFO - Executing <Task(PythonOperator): hello_task> on 2020-05-23T13:42:17.463955+00:00
[2020-05-23 20:43:53,477] {standard_task_runner.py:53} INFO - Started process 7442 to run task
[2020-05-23 20:43:53,685] {logging_mixin.py:112} INFO - Running %s on host %s <TaskInstance: hello_world.hello_task 2020-05-23T13:42:17.463955+00:00 [running]> LAPTOP-9BCTKM5O.localdomain
[2020-05-23 20:43:53,715] {python_operator.py:114} INFO - Done. Returned value was: Hello world!
[2020-05-23 20:43:53,738] {taskinstance.py:1052} INFO - Marking task as SUCCESS.dag_id=hello_world, task_id=hello_task, execution_date=20200523T134217, start_date=20200523T134353, end_date=20200523T134353
[2020-05-23 20:44:03,372] {logging_mixin.py:112} INFO - [2020-05-23 20:44:03,372] {local_task_job.py:103} INFO - Task exited with return code 0
but when i tried to test run a single task using this command
airflow test dags/main.py hello_task 2020-05-23
it shows this error
airflow.exceptions.AirflowException: dag_id could not be found: dags/main.py. Either the dag did not exist or it failed to parse.
where i went wrong ?

You got your airflow test command a tad wrong, instead of giving the path to the dag, dags/main.py, you need to type in the dag_id itself which is hello_world looking at your code.
So try this:
airflow test hello_world hello_task 2020-05-23
You should get output similar to this :)
airflow#940836ce7da4:/opt/airflow$ airflow test hello_world hello_task 2020-05-23
[2020-05-23 14:18:51,144] {__init__.py:51} INFO - Using executor CeleryExecutor
[2020-05-23 14:18:51,145] {dagbag.py:396} INFO - Filling up the DagBag from /opt/airflow/dags
[2020-05-23 14:18:51,190] {taskinstance.py:669} INFO - Dependencies all met for <TaskInstance: hello_world.hello_task 2020-05-23T00:00:00+00:00 [None]>
[2020-05-23 14:18:51,203] {taskinstance.py:669} INFO - Dependencies all met for <TaskInstance: hello_world.hello_task 2020-05-23T00:00:00+00:00 [None]>
[2020-05-23 14:18:51,203] {taskinstance.py:879} INFO -
--------------------------------------------------------------------------------
[2020-05-23 14:18:51,203] {taskinstance.py:880} INFO - Starting attempt 1 of 1
[2020-05-23 14:18:51,203] {taskinstance.py:881} INFO -
--------------------------------------------------------------------------------
[2020-05-23 14:18:51,204] {taskinstance.py:900} INFO - Executing <Task(PythonOperator): hello_task> on 2020-05-23T00:00:00+00:00
[2020-05-23 14:18:51,234] {python_operator.py:114} INFO - Done. Returned value was: Hello world!
[2020-05-23 14:18:51,249] {taskinstance.py:1065} INFO - Marking task as SUCCESS.dag_id=hello_world, task_id=hello_task, execution_date=20200523T000000, start_date=20200523T141851, end_date=20200523T141851

After af 2.0;
airflow tasks test dag_id task_id date

Related

About ExternalTaskMarker & ExternalTaskSensor in Apache Airflow

Suppose there are two DAGs i.e., DAG1 and DAG2. should i need to schedule ExternalTaskMarker
in DAG1 and ExternalTaskSensor at same time? or how can i schedule it
There are two DAGs namely DAG1 & DAG2 . I define ExternalTaskMarker in DAG1 which is scheduled at 15 * * * * and defined ExternalTaskSensor in DAG2 which is scheduled at 0 * * * * . i see that ExternalTaskMarker is running fine in DAG1 and in DAG2 ExternalTaskSensor is failing because of belo error
[2022-11-05 15:32:12,091] {{taskinstance.py:901}} INFO - Executing <Task(ExternalTaskSensor): external_task_sensor-plume_dev_download_dimensions_hourly-pan.device_master_dimension> on 2022-11-03T20:00:00+00:00
[2022-11-05 15:32:12,095] {{standard_task_runner.py:54}} INFO - Started process 48476 to run task
[2022-11-05 15:32:12,123] {{standard_task_runner.py:77}} INFO - Running: ['airflow', 'run', 'sd_plume_dev_incoming_data_quality_hourly-pan', 'external_task_sensor-plume_dev_download_dimensions_hourly-pan.device_master_dimension', '2022-11-03T20:00:00+00:00', '--job_id', '56965917', '--pool', 'default_pool', '--raw', '-sd', 'DAGS_FOLDER/sd_plume_dev_incoming_data_quality_hourly-pan.py', '--cfg_path', '/tmp/tmpy1p4tag5']
[2022-11-05 15:32:12,123] {{standard_task_runner.py:78}} INFO - Job 56965917: Subtask external_task_sensor-plume_dev_download_dimensions_hourly-pan.device_master_dimension
[2022-11-05 15:32:12,192] {{logging_mixin.py:112}} INFO - Running %s on host %s <TaskInstance: sd_plume_dev_incoming_data_quality_hourly-pan.external_task_sensor-plume_dev_download_dimensions_hourly-pan.device_master_dimension 2022-11-03T20:00:00+00:00 [running]> itclab21
[2022-11-05 15:32:12,278] {{external_task_sensor.py:117}} INFO - Poking for plume_dev_download_dimensions_hourly-pan.device_master_dimension on 2022-11-03T20:00:00+00:00 ...
[2022-11-05 15:32:12,301] {{taskinstance.py:1141}} INFO - Rescheduling task, marking task as UP_FOR_RESCHEDULE
[2022-11-05 15:32:17,075] {{local_task_job.py:102}} INFO - Task exited with return code 0

Airflow 2.0.2 - Hourly DAG getting stuck seeing Refreshing TaskInstance repeatedly

I've been noticing that some of the DAG runs for an hourly DAG are being skipped, I checked the log for the DAG run before it started skipping and noticed it had actually been running for 7 hours which is why other DAG runs didn't happen, it is very strange since it usually only takes 30 min to finish running.
We're using Airflow version 2.0.2
This is what I saw in the logs:
2022-05-06 13:26:56,668] {taskinstance.py:595} DEBUG - Refreshing TaskInstance <TaskInstance: dfp_hourly.revequery 2022-05-05T13:00:00+00:00 [running]> from DB
[2022-05-06 13:26:56,806] {taskinstance.py:630} DEBUG - Refreshed TaskInstance <TaskInstance: dfp_hourly.revequery 2022-05-05T13:00:00+00:00 [running]>
[2022-05-06 13:27:01,860] {taskinstance.py:595} DEBUG - Refreshing TaskInstance <TaskInstance: dfp_hourly.revequery 2022-05-05T13:00:00+00:00 [running]> from DB
[2022-05-06 13:27:01,872] {taskinstance.py:630} DEBUG - Refreshed TaskInstance <TaskInstance: dfp_hourly.revequery 2022-05-05T13:00:00+00:00 [running]>
[2022-05-06 13:27:06,960] {taskinstance.py:595} DEBUG - Refreshing TaskInstance <TaskInstance: dfp_hourly.revequery 2022-05-05T13:00:00+00:00 [running]> from DB
[2022-05-06 13:27:07,019] {taskinstance.py:630} DEBUG - Refreshed TaskInstance <TaskInstance: dfp_hourly.revequery 2022-05-05T13:00:00+00:00 [running]>
[2022-05-06 13:27:12,224] {taskinstance.py:595} DEBUG - Refreshing TaskInstance <TaskInstance: dfp_hourly.revequery 2022-05-05T13:00:00+00:00 [running]> from DB
[2022-05-06 13:27:12,314] {taskinstance.py:630} DEBUG - Refreshed TaskInstance <TaskInstance: dfp_hourly.revequery 2022-05-05T13:00:00+00:00 [running]>
[2022-05-06 13:27:17,368] {taskinstance.py:595} DEBUG - Refreshing TaskInstance <TaskInstance: dfp_hourly.revequery 2022-05-05T13:00:00+00:00 [running]> from DB
[2022-05-06 13:27:17,377] {taskinstance.py:630} DEBUG - Refreshed TaskInstance
well, I think you are running too many task-parallel which causes them to run for hours, well this can be fixed by using Pool. Airflow pools can be used to limit the execution parallelism on arbitrary sets of tasks. The list of pools is managed in the UI (Menu -> Admin -> Pools) by giving the pools a name and assigning it several worker slots.
Tasks can then be associated with one of the existing pools by using the pool parameter when creating tasks:
aggregate_db_message_job = BashOperator(
task_id="aggregate_db_message_job",
execution_timeout=timedelta(hours=3),
pool="ep_data_pipeline_db_msg_agg",
bash_command=aggregate_db_message_job_cmd,
dag=dag,
)
aggregate_db_message_job.set_upstream(wait_for_empty_queue)
Tasks will be scheduled as usual while the slots fill up. The number of slots occupied by a task can be configured by pool_slots (see the section below). Once capacity is reached, runnable tasks get queued and their state will show as such in the UI. As slots free up, queued tasks start running based on the Priority Weights of the task and its descendants.
Note that if tasks are not given a pool, they are assigned to a default pool default_pool, which is initialized with 128 slots and can be modified through the UI or CLI (but cannot be removed).

Airflow: How to use the same tasks in different dags

I am learning Airflow and run into a problem.
I have 2 tasks, that I want to use in several dags. The difference between these tasks will be only the parameters the operators are going to get.
This could be accomplished by simply copy and pasting the tasks into all the dags, but maintain this type of code would be a nightmare.
So what a want to do is to create a class that will contain the tasks I will be calling several times and just import this class from the dags.
I replicated the issue with a minimal example.
This is the code for the class:
from airflow.operators.bash_operator import BashOperator
class Operator_generator():
_instance = None
def __init__(self, var1, var2):
self.var1 = var1
self.var2 = var2
def create_task_1(self):
return BashOperator(
task_id='task1',
bash_command='echo Im running task 1, the current execution date is {{ds}} and the previous execution date is {{prev_ds}}'
)
def create_task_2(self):
return BashOperator(
task_id='task2',
bash_command='echo Im running task 2, the current execution date is {{ds}} and the previous execution date is {{prev_ds}}'
)
and this is a dag example where I would import the class
from include.src.date.decorator import DefaultDateTime
from airflow import DAG
from include.src.airflow.xcom import cleanup
from operator_creator import Operator_generator
dag_id = "dag1"
default_args = {
"owner": "airflow",
"depends_on_past": False,
"start_date": DefaultDateTime(2021, 6, 1),
'retries': 1
}
# Dag definition
with DAG(
dag_id,
schedule_interval='#monthly',
catchup=False,
on_failure_callback=cleanup,
on_success_callback=cleanup
) as dag:
dag.doc_md = __doc__
operator_generator = Operator_generator('var1','var2')
task1 = operator_generator.create_task_1()
task2 = operator_generator.create_task_2()
task1 >> task2
Note that 'var1' and 'var2' are variables that I need to parametrize the operators.
The problem is that when I run the dag the tasks run twice:
[2021-08-25 16:29:46,937] {taskinstance.py:880} INFO - Starting attempt 1 of 2
[2021-08-25 16:29:46,937] {taskinstance.py:881} INFO -
--------------------------------------------------------------------------------
[2021-08-25 16:29:46,955] {taskinstance.py:900} INFO - Executing <Task(BashOperator): task1> on 2021-07-01T06:00:00+00:00
[2021-08-25 16:29:46,961] {standard_task_runner.py:53} INFO - Started process 67689 to run task
[2021-08-25 16:29:47,011] {logging_mixin.py:112} INFO - Running %s on host %s <TaskInstance: dag1.task1 2021-07-01T06:00:00+00:00 [running]> 30b770753547
[2021-08-25 16:29:47,032] {bash_operator.py:113} INFO - Tmp dir root location:
/tmp
[2021-08-25 16:29:47,033] {bash_operator.py:136} INFO - Temporary script location: /tmp/airflowtmpixijgd4s/task1lfcwdvfa
[2021-08-25 16:29:47,033] {bash_operator.py:146} INFO - Running command: echo Im running task 1, the current execution date is 2021-07-01 and the previous execution date is 2021-06-01
[2021-08-25 16:29:47,039] {bash_operator.py:153} INFO - Output:
[2021-08-25 16:29:47,040] {bash_operator.py:157} INFO - Im running task 1, the current execution date is 2021-07-01 and the previous execution date is 2021-06-01
[2021-08-25 16:29:47,040] {bash_operator.py:161} INFO - Command exited with return code 0
[2021-08-25 16:29:47,052] {taskinstance.py:1065} INFO - Marking task as SUCCESS.dag_id=dag1, task_id=task1, execution_date=20210701T060000, start_date=20210825T162946, end_date=20210825T162947
[2021-08-25 16:29:55,335] {taskinstance.py:669} INFO - Dependencies all met for <TaskInstance: dag1.task1 2021-08-25T16:29:41+00:00 [queued]>
[2021-08-25 16:29:55,335] {taskinstance.py:669} INFO - Dependencies all met for <TaskInstance: dag1.task2 2021-07-01T06:00:00+00:00 [queued]>
[2021-08-25 16:29:55,348] {taskinstance.py:669} INFO - Dependencies all met for <TaskInstance: dag1.task1 2021-08-25T16:29:41+00:00 [queued]>
[2021-08-25 16:29:55,348] {taskinstance.py:879} INFO -
--------------------------------------------------------------------------------
[2021-08-25 16:29:55,348] {taskinstance.py:880} INFO - Starting attempt 1 of 2
[2021-08-25 16:29:55,348] {taskinstance.py:881} INFO -
--------------------------------------------------------------------------------
[2021-08-25 16:29:55,357] {taskinstance.py:669} INFO - Dependencies all met for <TaskInstance: dag1.task2 2021-07-01T06:00:00+00:00 [queued]>
[2021-08-25 16:29:55,357] {taskinstance.py:879} INFO -
--------------------------------------------------------------------------------
[2021-08-25 16:29:55,357] {taskinstance.py:880} INFO - Starting attempt 1 of 2
[2021-08-25 16:29:55,357] {taskinstance.py:881} INFO -
--------------------------------------------------------------------------------
[2021-08-25 16:29:55,363] {taskinstance.py:900} INFO - Executing <Task(BashOperator): task1> on 2021-08-25T16:29:41+00:00
[2021-08-25 16:29:55,366] {standard_task_runner.py:53} INFO - Started process 67809 to run task
[2021-08-25 16:29:55,370] {taskinstance.py:900} INFO - Executing <Task(BashOperator): task2> on 2021-07-01T06:00:00+00:00
[2021-08-25 16:29:55,374] {standard_task_runner.py:53} INFO - Started process 67810 to run task
[2021-08-25 16:29:55,412] {logging_mixin.py:112} INFO - Running %s on host %s <TaskInstance: dag1.task1 2021-08-25T16:29:41+00:00 [running]> 30b770753547
[2021-08-25 16:29:55,422] {logging_mixin.py:112} INFO - Running %s on host %s <TaskInstance: dag1.task2 2021-07-01T06:00:00+00:00 [running]> 30b770753547
[2021-08-25 16:29:55,430] {bash_operator.py:113} INFO - Tmp dir root location:
/tmp
[2021-08-25 16:29:55,432] {bash_operator.py:136} INFO - Temporary script location: /tmp/airflowtmpsacovlfm/task1doc6fakb
[2021-08-25 16:29:55,432] {bash_operator.py:146} INFO - Running command: echo Im running task 1, the current execution date is 2021-08-25 and the previous execution date is 2021-08-25
[2021-08-25 16:29:55,440] {bash_operator.py:153} INFO - Output:
[2021-08-25 16:29:55,440] {bash_operator.py:157} INFO - Im running task 1, the current execution date is 2021-08-25 and the previous execution date is 2021-08-25
[2021-08-25 16:29:55,441] {bash_operator.py:161} INFO - Command exited with return code 0
[2021-08-25 16:29:55,444] {bash_operator.py:113} INFO - Tmp dir root location:
/tmp
[2021-08-25 16:29:55,445] {bash_operator.py:136} INFO - Temporary script location: /tmp/airflowtmpyqqww8an/task2i29a2lk7
[2021-08-25 16:29:55,445] {bash_operator.py:146} INFO - Running command: echo Im running task 2, the current execution date is 2021-07-01 and the previous execution date is 2021-06-01
[2021-08-25 16:29:55,451] {taskinstance.py:1065} INFO - Marking task as SUCCESS.dag_id=dag1, task_id=task1, execution_date=20210825T162941, start_date=20210825T162955, end_date=20210825T162955
[2021-08-25 16:29:55,453] {bash_operator.py:153} INFO - Output:
[2021-08-25 16:29:55,453] {bash_operator.py:157} INFO - Im running task 2, the current execution date is 2021-07-01 and the previous execution date is 2021-06-01
[2021-08-25 16:29:55,454] {bash_operator.py:161} INFO - Command exited with return code 0
[2021-08-25 16:29:55,465] {taskinstance.py:1065} INFO - Marking task as SUCCESS.dag_id=dag1, task_id=task2, execution_date=20210701T060000, start_date=20210825T162955, end_date=20210825T162955
[2021-08-25 16:29:56,922] {logging_mixin.py:112} INFO - [2021-08-25 16:29:56,921] {local_task_job.py:103} INFO - Task exited with return code 0
[2021-08-25 16:30:05,333] {logging_mixin.py:112} INFO - [2021-08-25 16:30:05,333] {local_task_job.py:103} INFO - Task exited with return code 0
[2021-08-25 16:30:05,337] {logging_mixin.py:112} INFO - [2021-08-25 16:30:05,337] {local_task_job.py:103} INFO - Task exited with return code 0
[2021-08-25 16:30:06,794] {taskinstance.py:669} INFO - Dependencies all met for <TaskInstance: dag1.task2 2021-08-25T16:29:41+00:00 [queued]>
[2021-08-25 16:30:06,809] {taskinstance.py:669} INFO - Dependencies all met for <TaskInstance: dag1.task2 2021-08-25T16:29:41+00:00 [queued]>
[2021-08-25 16:30:06,809] {taskinstance.py:879} INFO -
--------------------------------------------------------------------------------
[2021-08-25 16:30:06,810] {taskinstance.py:880} INFO - Starting attempt 1 of 2
[2021-08-25 16:30:06,810] {taskinstance.py:881} INFO -
--------------------------------------------------------------------------------
[2021-08-25 16:30:06,822] {taskinstance.py:900} INFO - Executing <Task(BashOperator): task2> on 2021-08-25T16:29:41+00:00
[2021-08-25 16:30:06,826] {standard_task_runner.py:53} INFO - Started process 67937 to run task
[2021-08-25 16:30:06,875] {logging_mixin.py:112} INFO - Running %s on host %s <TaskInstance: dag1.task2 2021-08-25T16:29:41+00:00 [running]> 30b770753547
[2021-08-25 16:30:06,892] {bash_operator.py:113} INFO - Tmp dir root location:
/tmp
[2021-08-25 16:30:06,893] {bash_operator.py:136} INFO - Temporary script location: /tmp/airflowtmpot_xsukw/task2xo4uxspu
[2021-08-25 16:30:06,893] {bash_operator.py:146} INFO - Running command: echo Im running task 2, the current execution date is 2021-08-25 and the previous execution date is 2021-08-25
[2021-08-25 16:30:06,901] {bash_operator.py:153} INFO - Output:
[2021-08-25 16:30:06,902] {bash_operator.py:157} INFO - Im running task 2, the current execution date is 2021-08-25 and the previous execution date is 2021-08-25
[2021-08-25 16:30:06,902] {bash_operator.py:161} INFO - Command exited with return code 0
[2021-08-25 16:30:06,913] {taskinstance.py:1065} INFO - Marking task as SUCCESS.dag_id=dag1, task_id=task2, execution_date=20210825T162941, start_date=20210825T163006, end_date=20210825T163006
[2021-08-25 16:30:16,800] {logging_mixin.py:112} INFO - [2021-08-25 16:30:16,799] {local_task_job.py:103} INFO - Task exited with return code 0
Notice how the tasks are executed 2 times:
In the first execution the values of {{ds}} and {{prev_ds}} are the current date.
In the second execution the values of {{ds}} and {{prev_ds}} correspond to the monthly interval.
Why the tasks run 2 times?
Is there a way to import tasks like this?
Note 1: I am not allowed to use subdags.
Edit: Adding the execution tree
Edit 2:
If anyone run into this problem I figured out.
The problem was that I was running the dag with an external trigger, that deletes the dag and start it again. So the the dag runs for the external trigger, but also the scheduler sees that the dag hasn't run for the month, so it schedules the execution, resulting in 2 runs.
The solution I found is:
Turn off the dag in the airflow interface
Delete the dag (with the red x in the far right of the dag)
Refresh the page
The dag appears again in the list, turn it on
This will make the scheduler do its job and the dag will run as it should.

Airlfow Execution Timeout not working well

I've set 'execution_timeout': timedelta(seconds=300) parameter on many tasks. When the execution timeout is set on task downloading data from Google Analytics it works properly - after ~300 seconds is the task set to failed. The task downloads some data from API (python), then it does some transformations (python) and loads data into PostgreSQL.
Then I've a task which executes only one PostgreSQL function - execution sometimes takes more than 300 seconds but I get this (task is marked as finished successfully).
*** Reading local file: /home/airflow/airflow/logs/bulk_replication_p2p_realtime/t1/2020-07-20T00:05:00+00:00/1.log
[2020-07-20 05:05:35,040] {__init__.py:1139} INFO - Dependencies all met for <TaskInstance: bulk_replication_p2p_realtime.t1 2020-07-20T00:05:00+00:00 [queued]>
[2020-07-20 05:05:35,051] {__init__.py:1139} INFO - Dependencies all met for <TaskInstance: bulk_replication_p2p_realtime.t1 2020-07-20T00:05:00+00:00 [queued]>
[2020-07-20 05:05:35,051] {__init__.py:1353} INFO -
--------------------------------------------------------------------------------
[2020-07-20 05:05:35,051] {__init__.py:1354} INFO - Starting attempt 1 of 1
[2020-07-20 05:05:35,051] {__init__.py:1355} INFO -
--------------------------------------------------------------------------------
[2020-07-20 05:05:35,098] {__init__.py:1374} INFO - Executing <Task(PostgresOperator): t1> on 2020-07-20T00:05:00+00:00
[2020-07-20 05:05:35,099] {base_task_runner.py:119} INFO - Running: ['airflow', 'run', 'bulk_replication_p2p_realtime', 't1', '2020-07-20T00:05:00+00:00', '--job_id', '958216', '--raw', '-sd', 'DAGS_FOLDER/bulk_replication_p2p_realtime.py', '--cfg_path', '/tmp/tmph11tn6fe']
[2020-07-20 05:05:37,348] {base_task_runner.py:101} INFO - Job 958216: Subtask t1 [2020-07-20 05:05:37,347] {settings.py:182} INFO - settings.configure_orm(): Using pool settings. pool_size=10, pool_recycle=1800, pid=26244
[2020-07-20 05:05:39,503] {base_task_runner.py:101} INFO - Job 958216: Subtask t1 [2020-07-20 05:05:39,501] {__init__.py:51} INFO - Using executor LocalExecutor
[2020-07-20 05:05:39,857] {base_task_runner.py:101} INFO - Job 958216: Subtask t1 [2020-07-20 05:05:39,856] {__init__.py:305} INFO - Filling up the DagBag from /home/airflow/airflow/dags/bulk_replication_p2p_realtime.py
[2020-07-20 05:05:39,894] {base_task_runner.py:101} INFO - Job 958216: Subtask t1 [2020-07-20 05:05:39,894] {cli.py:517} INFO - Running <TaskInstance: bulk_replication_p2p_realtime.t1 2020-07-20T00:05:00+00:00 [running]> on host dwh2-airflow-dev
[2020-07-20 05:05:39,938] {postgres_operator.py:62} INFO - Executing: CALL dw_system.bulk_replicate(p_graph_name=>'replication_p2p_realtime',p_group_size=>4 , p_group=>1, p_dag_id=>'bulk_replication_p2p_realtime', p_task_id=>'t1')
[2020-07-20 05:05:39,960] {logging_mixin.py:95} INFO - [2020-07-20 05:05:39,953] {base_hook.py:83} INFO - Using connection to: id: postgres_warehouse. Host: XXX Port: 5432, Schema: XXXX Login: XXX Password: XXXXXXXX, extra: {}
[2020-07-20 05:05:39,973] {logging_mixin.py:95} INFO - [2020-07-20 05:05:39,972] {dbapi_hook.py:171} INFO - CALL dw_system.bulk_replicate(p_graph_name=>'replication_p2p_realtime',p_group_size=>4 , p_group=>1, p_dag_id=>'bulk_replication_p2p_realtime', p_task_id=>'t1')
[2020-07-20 05:23:21,450] {logging_mixin.py:95} INFO - [2020-07-20 05:23:21,449] {timeout.py:42} ERROR - Process timed out, PID: 26244
[2020-07-20 05:23:36,453] {logging_mixin.py:95} INFO - [2020-07-20 05:23:36,452] {jobs.py:2562} INFO - Task exited with return code 0
Does anyone know how to enforce execution timeout out for such long running functions? It seems that the execution timeout is evaluated once the PG function finish.
Airflow uses the signal module from the standard library to affect a timeout. In Airflow it's used to hook into these system signals and request that the calling process be notified in N seconds and, should the process still be inside the context (see the __enter__ and __exit__ methods on the class) it will raise an AirflowTaskTimeout exception.
Unfortunately for this situation, there are certain classes of system operations that cannot be interrupted. This is actually called out in the signal documentation:
A long-running calculation implemented purely in C (such as regular expression matching on a large body of text) may run uninterrupted for an arbitrary amount of time, regardless of any signals received. The Python signal handlers will be called when the calculation finishes.
To which we say "But I'm not doing a long-running calculation in C!" -- yeah for Airflow this is almost always due to uninterruptable I/O operations.
The highlighted sentence above (emphasis mine) nicely explains why the handler is still triggered even after the task is allowed to (frustratingly!) finish, well beyond your requested timeout.

DummyOperator marked upstream_failed yet all upstream tasks marked success

I have an Airflow pipeline that produces 12 staging tables from Google Cloud Storage files and then performs some downstream processing. I have a DummyOperator to collect these tasks before proceeding to the next stages.
I'm getting an error on the wait_stg_load operator saying it's in an upstream_failed state. However all of the upstream tasks are marked as success. The DAG itself is now marked as failed. If I clear the status on wait_stg_load, everything proceeds fine. Any ideas on what I'm doing wrong?
I am using Google Cloud Composer which is version Airflow v 1.9 on Python 3
with DAG('load_data',
default_args=default_args,
schedule_interval='0 9 * * *',
concurrency=3
) as dag:
t2 = DummyOperator(
task_id='wait_stg_load',
dag=dag
)
for t in tables:
t1 = GoogleCloudStorageToBigQueryOperator(
task_id='load_stg_{}'.format(t.replace('.','_')),
bucket='my-bucket',
source_objects=['data/{}.json'.format(t)],
destination_project_dataset_table='{}.stg_{}'.format(DATASET_NAME, t.replace('.','_')),
schema_object='data/schemas/{}.json'.format(t),
source_format='NEWLINE_DELIMITED_JSON',
write_disposition='WRITE_TRUNCATE',
dag=dag
)
t1 >> t2
Update 1
I believe this is a concurrency issue within Airflow. I noticed that the task does indeed fail at some point, but later runs anyway. It gets marked complete, yet the DummyOperator doesn't see that.
[2019-02-14 09:00:14,734] {cli.py:374} INFO - Running on host airflow-worker
[2019-02-14 09:00:16,686] {models.py:1196} INFO - Dependencies all met for <TaskInstance: dag.task 2019-02-13 09:00:00 [queued]>
[2019-02-14 09:00:16,694] {models.py:1189} INFO - Dependencies not met for <TaskInstance: dag.task 2019-02-13 09:00:00 [queued]>, dependency 'Task Instance Slots Available' FAILED: The maximum number of running tasks (3) for this task's DAG 'dag' has been reached.
[2019-02-14 09:00:16,694] {models.py:1389} WARNING -
-------------------------------------------------------------------------------
FIXME: Rescheduling due to concurrency limits reached at task runtime. Attempt 1 of 1. State set to NONE
-------------------------------------------------------------------------------
[2019-02-14 09:00:16,694] {models.py:1392} INFO - Queuing into pool None
[2019-02-14 09:00:26,619] {cli.py:374} INFO - Running on host airflow-worker
[2019-02-14 09:00:28,563] {models.py:1196} INFO - Dependencies all met for <TaskInstance: dag.task 2019-02-13 09:00:00 [failed]>
[2019-02-14 09:00:28,570] {models.py:1196} INFO - Dependencies all met for <TaskInstance: dag.task 2019-02-13 09:00:00 [failed]>
[2019-02-14 09:00:28,570] {models.py:1406} INFO -
-------------------------------------------------------------------------------
Starting attempt 1 of
-------------------------------------------------------------------------------
[2019-02-14 09:00:28,607] {models.py:1427} INFO - Executing <Task(GoogleCloudStorageToBigQueryOperator): task> on 2019-02-13 09:00:00

Resources