Airflow tasks set to `no_status` when catchup is True - airflow

I'm attempting to configure a series of Airflow tasks to backfill some data (catchup=True). Once the DAG is deployed and unpaused, the first job runs successfully, but all subsequent runs have their tasks set to no_status and they never start.
I've tried variations on renaming the DAG, restarting the Airflow server and scheduler, clearing out old logs, but I'm not making any progress here.
Thoughts?
DAG code:
default_args = {
"owner": "me",
"retries": 2,
"retry_delay": timedelta(minutes=2),
"sla": timedelta(hours=1),
"start_date": "2021-01-01T00:00",
}
dag = DAG(
catchup=True,
dag_id="ingest_dag_testing_6",
dagrun_timeout=timedelta(hours=1),
default_args=default_args,
max_active_runs=1,
schedule_interval="30 * * * *",
)
DATA_SOURCE_TYPES = [
{
"target_name": "task_a",
"children": [
{
"target_name": "subtask_a1",
},
{
"target_name": "subtask_a2",
},
],
}
]
with dag:
for dst in DATA_SOURCE_TYPES:
sub_ingest_tasks = []
ingest_task = PythonOperator(
task_id=f"ingest_{dst.get('target_name')}",
python_callable=run_ingestion,
op_args=[logger, exe_date, dst],
)
if dst.get("children"):
for sdst in dst.get("children"):
sub_ingest_tasks.append(
PythonOperator(
task_id=f"ingest_{sdst.get('target_name')}",
python_callable=run_ingestion,
op_args=[logger, exe_date, sdst],
)
)
ingest_task >> sub_ingest_tasks

Your code is executed just fine.
I created a runnable example from your code (as it lack imports/callables):
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import timedelta
def run_ingestion(**context):
print("Hello World")
default_args = {
"owner": "me",
"retries": 2,
"retry_delay": timedelta(minutes=2),
"sla": timedelta(hours=1),
"start_date": "2021-01-01T00:00",
}
dag = DAG(
catchup=True,
dag_id="ingest_dag_testing_6",
dagrun_timeout=timedelta(hours=1),
default_args=default_args,
max_active_runs=1,
schedule_interval="30 * * * *",
)
DATA_SOURCE_TYPES = [
{
"target_name": "task_a",
"children": [
{
"target_name": "subtask_a1",
},
{
"target_name": "subtask_a2",
},
],
}
]
with dag:
for dst in DATA_SOURCE_TYPES:
sub_ingest_tasks = []
ingest_task = PythonOperator(
task_id=f"ingest_{dst.get('target_name')}",
python_callable=run_ingestion,
#op_args=[logger, exe_date, dst],
)
if dst.get("children"):
for sdst in dst.get("children"):
sub_ingest_tasks.append(
PythonOperator(
task_id=f"ingest_{sdst.get('target_name')}",
python_callable=run_ingestion,
#op_args=[logger, exe_date, sdst],
)
)
ingest_task >> sub_ingest_tasks
You can see it's working fine:
If you are running old Airflow version it's possible that changing the dag_id may fix the problem. It could be that there are some old traces of db records related to this dag_id which were not cleaned properly. The scheduler was refactored significantly in later versions.
If the above did not help probably the only solution would be to upgrade to the latest Airflow version as it's probably a bug in older versions which was fixed along the way (Since the code you shared doesn't reproduce the problem you are describing in latest Airflow version).

Related

MWAA ECSOperator "No task found" but succeeds

I have an ECSOperator task in MWAA.
When I trigger the task, it succeeds immediately. However, the task should take time to complete, so I do not believe it is actually starting.
When I go to inspect the task run, I get the error "No tasks found".
The task definition looks like this:
from datetime import datetime
from airflow import DAG
from airflow.providers.amazon.aws.operators.ecs import ECSOperator
dag = DAG(
"my_dag",
description = "",
start_date = datetime.fromisoformat("2022-03-28"),
catchup = False,
)
my_task = ECSOperator(
task_id = "my_task",
cluster = "my-cluster",
task_definition = "my-task",
launch_type = "FARGATE",
aws_conn_id = "aws_ecs",
overrides = {},
network_configuration = {
"awsvpcConfiguration": {
"securityGroups": [ "sg-aaaa" ],
"subnets": [ "subnet-bbbb" ],
},
},
awslogs_group = "/ecs/my-task",
)
my_task
What am I missing here?
If task executed it should have a log.
I think your issue is that the task you defined is not assigned to any DAG object thus you see No task found error (empty DAG)
You should add dag=dag:
my_task = ECSOperator(
task_id = "my_task",
...,
dag=dag
)
or use context manager to avoid such issue:
with DAG(
dag_id="my_dag",
...
) as dag:
my_task = ECSOperator(
task_id = "my_task",
...,
)
If you are using Airflow 2 you can also use dag decorator.

DatabricksRunOperator Execution date

opr_run_now = DatabricksRunNowOperator(
task_id = 'run_now',
databricks_conn_id = 'databricks_default',
job_id = 754377,
notebook_params = meta_data,
dag = dag
) here
Is there way to pass execution date using databricks run operator.
What do you want to pass the execution_date to? What are you trying to achieve in the end? The following doc was helpful for me:
https://www.astronomer.io/guides/airflow-databricks
And here is an example where I am passing execution_date to be used in a python file run in Databricks. I'm capturing the execution_date using sys.argv.
from airflow import DAG
from airflow.providers.databricks.operators.databricks import (
DatabricksRunNowOperator,
)
from datetime import datetime, timedelta
spark_python_task = {
"python_file": "dbfs:/FileStore/sandbox/databricks_test_python_task.py"
}
# Define params for Run Now Operator
python_params = [
"{{ execution_date }}",
"{{ execution_date.subtract(hours=1) }}",
]
default_args = {
"owner": "airflow",
"depends_on_past": False,
"email_on_failure": False,
"email_on_retry": False,
"retries": 1,
"retry_delay": timedelta(minutes=2),
}
with DAG(
dag_id="databricks_dag",
start_date=datetime(2022, 3, 11),
schedule_interval="#hourly",
catchup=False,
default_args=default_args,
max_active_runs=1,
) as dag:
opr_run_now = DatabricksRunNowOperator(
task_id="run_now",
databricks_conn_id="Databricks",
job_id=2060,
python_params=python_params,
)
opr_run_now
There are two ways to set DatabricksRunOperator. One with named arguments (as you did) - which doesn't support templating. The second way is to use JSON payload that you typically use to call the api/2.0/jobs/run-now - This way also gives you the ability to pass execution_date as the json parameter is templated.
notebook_task_params = {
'new_cluster': new_cluster,
'notebook_task': {
'notebook_path': '/test-{{ ds }}',
}
DatabricksSubmitRunOperator(task_id='notebook_task', json=notebook_task_params)
For more information see the operator docs.

Airflow tasks not gettin running

I am trying to run a simple BASHOperator task in Airflow. The DAG when trigerred manually lists the tasks in Tree and Graph view but the tasks are always in not started state.
I have restarted my Airflow scheduler. I am running Airflow on local host using a Kubectl image on Docker Compose.
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'email': ['vijayraghunath21#gmail.com'],
'email_on_success': True,
'email_on_failure': True,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=2),
}
with DAG(
dag_id='bash_demo',
default_args=default_args,
description='Bash Demo',
start_date=datetime(2021, 1, 1),
# schedule_interval='0 2 * * *',
schedule_interval=None,
max_active_runs=1,
catchup=False,
tags=['bash_demo'],
) as dag:
dag.doc_md = __doc__
# Task 1
dummy_task = DummyOperator(task_id='dummy_task')
# Task 2
bash_task = BashOperator(
task_id='bash_task', bash_command="echo 'command executed from BashOperator'")
dummy_task >> bash_task
DAG Image
As shown on the image you added the DAG is set to off thus it's not running. You should click on the toggle button to set it to on.
This issue can be avoided in two ways:
Global solution- if you wills set dags_are_paused_at_creation = False in airflow.cfg - This will effect all DAGs in the system.
Local solution - if you will use is_paused_upon_creation in the DAG contractor:
with DAG(
dag_id='bash_demo',
...
is_paused_upon_creation=False,
) as dag:
This parameter specifies if the dag is paused when created for the first time. If the dag exists already, the parameter is being ignored.

Airflow Debugging: How to skip backfill job execution when running DAG in vscode

I have setup airflow and am running a DAG using the following vscode debug configuration:
{
"version": "0.2.0",
"configurations": [
{
"name": "Python: Current File",
"type": "python",
"request": "launch",
"program": "${file}",
"console": "integratedTerminal",
"justMyCode": false,
"env":{
"AIRFLOW__CORE__EXECUTOR": "DebugExecutor",
"AIRFLOW__DEBUG__FAIL_FAST": "True",
"LC_ALL": "en_US.UTF-8",
"LANG": "en_US.UTF-8"
}
}
]
}
It runs the file, my breakpoints DAG defs break as expected, then at the end of the file: It executes the dag.run() and then I wait forever for the dag to backfill, and my breakpoints within python_callable functions of tasks never break.
What airflow secret am I not seeing?
Here is my dag:
# scheduled to run every minute, poke for a new file every ten seconds
dag = DAG(
dag_id='download-from-s3',
start_date=days_ago(2),
catchup=False,
schedule_interval='*/1 * * * *',
is_paused_upon_creation=False
)
def new_file_detection(**context):
print("File found...") # a breakpoint here never lands
pprint(context)
init = BashOperator(
task_id='init',
bash_command='echo "My DAG initiated at $(date)"',
dag=dag,
)
file_sensor = S3KeySensor(
task_id='file_sensor',
poke_interval=10, # every 10 seconds
timeout=60,
bucket_key="s3://inbox/new/*",
bucket_name=None,
wildcard_match=True,
soft_fail=True,
dag=dag
)
file_found_message = PythonOperator(
task_id='file_found_message',
provide_context=True,
python_callable=new_file_detection,
dag=dag
)
init >> file_sensor >> file_found_message
if __name__ == '__main__':
dag.clear(reset_dag_runs=True)
dag.run() #this triggers a backfill job
This is working for me as expected. I can set breakpoints at DAG level, or inside the python callables definition and go through them using VSCode debugger.
I'm using the same debug settings that you provided, but I changed the parameter reset_dag_runs=True to dag_run_state=State.NONE during dag.clear() call, as specified on the DebugExecutor docs page. I believe this has changed on one of the latest releases.
Regarding backfills, I'm setting catchup=False on the DAG arguments (it works both ways). Important note, I'm running version 2.0.0 of Airflow.
Here is an example using the same code from example_xcomp.py that comes with the default installation:
Debug settings:
{
"version": "0.2.0",
"configurations": [
{
"name": "Python: Current File",
"type": "python",
"request": "launch",
"program": "${file}",
"console": "internalConsole",
"justMyCode": false,
"env":{
"AIRFLOW__CORE__EXECUTOR": "DebugExecutor",
"AIRFLOW__DEBUG__FAIL_FAST": "True",
}
}
]
}
Example DAG:
import logging
from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.utils.dates import days_ago
dag = DAG(
'excom_xample',
schedule_interval="#once",
start_date=days_ago(2),
default_args={'owner': 'airflow'},
tags=['example'],
catchup=False
)
value_1 = [1, 2, 3]
value_2 = {'a': 'b'}
def push(**kwargs):
"""Pushes an XCom without a specific target"""
logging.info("log before PUSH") # <<<<<<<<<<< Before landing on breakpoint
kwargs['ti'].xcom_push(key='value from pusher 1', value=value_1)
def push_by_returning(**kwargs):
"""Pushes an XCom without a specific target, just by returning it"""
return value_2
def puller(**kwargs):
"""Pull all previously pushed XComs and
check if the pushed values match the pulled values."""
ti = kwargs['ti']
# get value_1
pulled_value_1 = ti.xcom_pull(key=None, task_ids='push')
print("PRINT Line after breakpoint ") # <<<< After landing on breakpoint
if pulled_value_1 != value_1:
raise ValueError("The two values differ"
f"{pulled_value_1} and {value_1}")
# get value_2
pulled_value_2 = ti.xcom_pull(task_ids='push_by_returning')
if pulled_value_2 != value_2:
raise ValueError(
f'The two values differ {pulled_value_2} and {value_2}')
# get both value_1 and value_2
pulled_value_1, pulled_value_2 = ti.xcom_pull(
key=None, task_ids=['push', 'push_by_returning'])
if pulled_value_1 != value_1:
raise ValueError(
f'The two values differ {pulled_value_1} and {value_1}')
if pulled_value_2 != value_2:
raise ValueError(
f'The two values differ {pulled_value_2} and {value_2}')
push1 = PythonOperator(
task_id='push',
dag=dag,
python_callable=push,
)
push2 = PythonOperator(
task_id='push_by_returning',
dag=dag,
python_callable=push_by_returning,
)
pull = PythonOperator(
task_id='puller',
dag=dag,
python_callable=puller,
)
pull << [push1, push2]
if __name__ == '__main__':
from airflow.utils.state import State
dag.clear(dag_run_state=State.NONE)
dag.run()

Airflow does not run task

I've written a python task for Airflow. When I load my DAG into Airflow, it shows everything just fine. Once I trigger a run for my DAG, it will create a run and switch to succeeded without ever running my task. The webserver and scheduler are both running and there's nothing in the log.
The task doesn't even get a status while running (not even skipped).
If I run my task directly using airflow test update_dags work 2019-01-01 it runs just fine.
This is my DAG:
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta
import os
default_args = {
'owner': 'Airflow',
'depends_on_past': False,
'start_date': datetime.today(),
'email': ['***redacted***'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
'params': {
'git_repository': '***redacted***',
'git_ref': 'origin/master',
'git_folder': '/opt/dag-repository'
}
}
def command(cmd: str, *args):
templated_command = cmd.format(*args)
print('Running command: {}'.format(templated_command))
os.system(templated_command)
def do_work(**kwargs):
params = kwargs['params']
dag_directory = kwargs['conf'].get('core', 'dags_folder')
git_repository = params['git_repository']
git_ref = params['git_ref']
git_folder = params['git_folder']
command('if [ ! -d {0} ]; then git clone {1} {0}; fi', git_folder, git_repository)
command('cd {0}; git fetch -apt', git_folder)
command('cd {0}; git reset --hard {1}', git_folder, git_ref)
command('ln -sf {0} {1}', '{}/src'.format(git_folder), dag_directory)
with DAG('update_dags', default_args=default_args, schedule_interval=timedelta(minutes=10), max_active_runs=1) as dag:
work_stage = PythonOperator(
task_id="work",
python_callable=do_work,
provide_context=True
)
I also had the same problem. I solved the problem by turning os.system(cmd) into subprocess.run(cmd, shell=True, check=True), but I'm not quite sure why. hope this helps

Resources