Airflow: DAG marked successful, but task was not scheduled - airflow

I have a slightly complex setup:
I run my Airflow (v1.10.13) pipelines in local time (setup on the VM's timezone). The following DAG was marked for the Monday run as successful, but the task within was never scheduled (thus having no logs whatsoever). I had some issues with the Airflow scheduler and usage of non-UTC timezones in the past, so I wonder, if that could be another reason?
from airflow import DAG
from datetime import timedelta
from somewhere import get_localized_yesterday
import prepered_tasks as t
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': get_localized_yesterday(),
'email': [],
'email_on_failure': True,
'email_on_retry': False,
'retry_delay': timedelta(minutes=1)
}
# Schedule the DAG daily at 2 a.m.
dag = DAG(
'descriptive_DAG_name',
default_args=default_args,
description='',
schedule_interval='0 2 * * Mon-Fri',
tags=['PROD']
)
single_task = t.task_partial(dag=dag)
single_task
The 'task_partial' is a task object embedded in a partial, so I only need to provide the dag for instantiation. This works as intended in other pipelines, which work properly.
I checked the usual suspects:
The scheduler is running.
The workers are running.
The DAG is turned on in the UI.
Other pipelines behave normally.
No dependencies on previous runs.
Start date lies well in the past.

There was a bug in airflow 1.10.13 and the release was yanked.
You should upgrade to 1.10.14.
See the issue and the fix.
Quote from issue:
After performing an upgrade to v1.10.13 we noticed that tasks in some of our DAGs were not be scheduled. After a bit of investigation we discovered that by commenting out 'depends_on_past': True the issue went away.

Related

Where to find Airflow SequentialExecutor worker error logs

I'm new to Airflow. I'm following the offical tutorial to set up the first DAG and task
from airflow import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'admin',
'retries': 3,
'retry_delay': timedelta(minutes=1)
}
with DAG(
dag_id="hello_world_dag",
description="Hello world DAG",
start_date=datetime(2023, 1, 16),
schedule_interval='#daily',
default_args=default_args
) as dag:
task1 = BashOperator(
task_id="hello_task",
bash_command="echo hello world!"
)
task1
When I tried to run this manually, it always failed. I've checked the web server logs and the scheduler logs, they don't have any obvious errors. I also checked the task run logs, it's empty.
The setup is pretty simple: SequentialExecutor with sqlite. My question is: where can I see the worker logs, or any other places that have any useful message logged?
Ok finally figured this out.
Firstly let me correct my question - there's actually an error raised in scheduler log that the "BashTaskRunner" cannot be loaded. So I searched Airflow's source code, and found it was renamed to StandardBashRunner like 3 years ago(link).
This is the only occurrence of the word BashTaskRunner in the whole repo. So I'm curious how the AIRFLOW_HOME/airflow.cfg is generated, which sets this as the default task_runner value.

Dags triggered by DagTriggenRunOperator are stuck in queue

example_trigger_controller_dag.py
import pendulum
from airflow import DAG
from airflow.operators.trigger_dagrun import TriggerDagRunOperator
with DAG(
dag_id="example_trigger_controller_dag",
start_date=pendulum.datetime(2021, 1, 1, tz="UTC"),
catchup=False,
schedule="#once",
tags=["example"],
) as dag:
trigger = TriggerDagRunOperator(
task_id="test_trigger_dagrun",
trigger_dag_id="example_trigger_target_dag", # Ensure this equals the dag_id of the DAG to trigger
conf={"message": "Hello World"},
)
example_trigger_target_dag.py
import pendulum
from airflow import DAG
from airflow.decorators import task
from airflow.operators.bash import BashOperator
#task(task_id="run_this")
def run_this_func(dag_run=None):
"""
Print the payload "message" passed to the DagRun conf attribute.
:param dag_run: The DagRun object
"""
print(f"Remotely received value of {dag_run.conf.get('message')} for key=message")
with DAG(
dag_id="example_trigger_target_dag",
start_date=pendulum.datetime(2021, 1, 1, tz="UTC"),
catchup=False,
schedule=None,
tags=["example"],
) as dag:
run_this = run_this_func()
bash_task = BashOperator(
task_id="bash_task",
bash_command='echo "Here is the message: $message"',
env={"message": '{{ dag_run.conf.get("message") }}'},
)
the task in controller dag successfully ended but the task in target dag stuck in queue. Any ideas about how to solve this problem?
I ran your DAGs (with both of them unpaused) and they work fine in a completely new environment (Airflow 2.5.0, Astro CLI Runtime 7.1.0). So the issue is most likely not with your DAG code.
Tasks stuck in queue is often an issue with the scheduler, mostly with older Airflow versions. I suggest you:
make sure both DAGs are unpaused when the first DAG runs.
make sure all start_dates are in the past (though in this case usually the tasks don't even get queued)
restart your scheduler/Airflow environment
try running the DAGs while no other DAGs are running to check if the issue could be that the parallelism limit is reached. (if you are using K8s executor you should also check worker_pods_creation_batch_size and with the Celery Executor worker_concurrency and stalled_task_timeout)
take a look at your scheduler logs (at $AIRFLOW_HOME/logs/scheduler)
upgrade Airflow if you are running an older version.

Airflow DAGs tasks not running when i run DAG despite tasks working fine when i test them

I have the following DAG defined in code:
from datetime import timedelta, datetime
import airflow
from airflow import DAG
from airflow.operators.docker_operator import DockerOperator
from airflow.contrib.operators.ecs_operator import ECSOperator
default_args = {
'owner': 'airflow',
'retries': 1,
'retry_delay': timedelta(minutes=5),
'start_date': datetime(2018, 9, 24, 10, 00, 00)
}
dag = DAG(
'data-push',
default_args=default_args,
schedule_interval='0 0 * * 1,4',
)
colors = ['blue', 'red', 'yellow']
for color in colors:
ECSOperator(dag=dag,
task_id='data-push-for-%s' % (color),
task_definition= 'generic-push-colors',
cluster= 'MY_ECS_CLUSTER_ARN',
launch_type= 'FARGATE',
overrides={
'containerOverrides': [
{
'name': 'push-colors-container',
'command': [color]
}
]
},
region_name='us-east-1',
network_configuration={
'awsvpcConfiguration': {
'securityGroups': ['MY_SG'],
'subnets': ['MY_SUBNET'],
'assignPublicIp': "ENABLED"
}
},
)
This should create a DAG with 3 tasks, one for each color in my colors list.
This seems good, when i run:
airflow list_dags
I see my dag listed:
data-push
And when I run:
airflow list_tasks data-push
I see my three tasks appear as they should:
data-push-for-blue
data-push-for-red
data-push-for-yellow
I then test run one of my tasks by entering the following into terminal:
airflow run data-push data-push-for-blue 2017-1-23
And this runs the task, which I can see appear in my ECS cluster on the aws dashboard so I know for a fact the task runs on my ECS cluster and the data is pushed succesfully and everything is great.
Now when I try and run the DAG data-push from the Airflow UI is where i run into a problem.
I run:
airflow initdb
followed by:
airflow webserver
and now go into the airflow UI at localhost:8080.
I see the dag data-push in the list of dags, click it, and then to test run the entire dag i click the "Trigger DAG" button. I don't add any configuration json and then click 'Trigger'. The tree view for the DAG then shows a green circle on the right of the tree structure, seemingly indicating the DAG is 'running'. But the green circle just stays there for ages and when I manually check my ECS dashboard I see no tasks actually running so nothing is happening after triggering the DAG from the Airflow UI, despite the tasks working when i manually run them from the CLI.
I am using the SequentialExecutor if that matters.
My two main theories as to why the triggering the DAG does nothing when running the individual tasks from the CLI works are that maybe I am missing something in my python code where I define the dag (maybe because I dont specifiy any dependencies for the tasks?) or that I am not running the airflow scheduler but if I am manually triggering the DAGS from the Airflow UI i don't see why the scheduler would need to be running and why it wouldn't show me an error saying this is a problem.
Any ideas?
Sounds like you did not unpause your dag:
Toggle On/Off switch in the upper left of Web UI or using cli: airflow unpause <dag_id>.

How to fix WARNING - schedule_interval is used for <taskname> though it has been deprecated as a task parameter

I keep getting the warning:
WARNING - schedule_interval is used for <Task(BigQueryOperator): mytask>, though it has been deprecated as a task parameter, you need to specify it as a DAG parameter instead
I am on google composer 1.9.0
schedule_interval = datetime.timedelta(days=1)
default_args = {
'owner': 'e',
'catchup': False,
'start_date': datetime.datetime(2019, 10, 25),
}
with models.DAG(
dag_id=f"mydag",
schedule_interval=schedule_interval,
default_args=default_args,
) as dag:
mytask
How do I address this warning? I thought that by explicitly specifying the schedule_interval I would avoid this.
I faced the same issue while migrating to the latest Airflow version 1.10.10.
The message just says that you not only defined the schedule_interval on DAG level (like you did in your code snippet), but also on task level. In your example in the BigQueryOperator task. But defining schedule_interval for a task is deprecated. The same like e.g. retry_delay_in_hours.
To prevent this warning you just have to remove this attribute from all of your tasks.
The schedule interval from the DAG is automatically applied to the task(s). See docs for more information
Cheers
Michael

How to stop DAG from backfilling? catchup_by_default=False and catchup=False does not seem to work and Airflow Scheduler from backfilling

The setting catchup_by_default=False in airflow.cfg does not seem to work. Also adding catchup=False to the DAG doesn't work neither.
Here's how to reproduce the issue. I always start from a clean slate by running airflow resetdb. As soon as I unpause the dag, the tasks start to backfill.
Here's the setup for the dag. I'm just using the tutorial example.
default_args = {
"owner": "airflow",
"depends_on_past": False,
"start_date": datetime(2018, 9, 16),
"email": ["airflow#airflow.com"],
"email_on_failure": False,
"email_on_retry": False,
"retries": 1,
"retry_delay": timedelta(minutes=5),
}
dag = DAG("tutorial", default_args=default_args, schedule_interval=timedelta(1), catchup=False)
To be clear if you enabled this DAG that you specified when the time now is 2018-10-22T9:00:00.000EDT (which is what, 2018-10-22T13:00:00.000Z) it would be would be started some time after 2018-10-22T13:00:00.000Z with a run date marked 2018-10-21T00:00:00.000Z.
This is not back filling from the start date, but without any prior run, it does "catchup" the most recent completed valid period; I'm not sure why that's been the case in Airflow for a while, but it's documented that catchup=False means create a single run of the very most recent valid period.
If the dagrun run date is further confusing to you, please recall that run dates are the execution_date which is the start of the interval period. The data for the interval is only completely available at the end of the interval period, but Airflow is designed to pass in the start of the period.
Then the next run would start sometime after 2018-10-23T00:00:00.000Z with an execution_date set as 2018-10-22T00:00:00.000Z.
If, on the 22nd or later, you're getting any run date earlier than the 21st, or multiple runs scheduled, then yes catchup=False is not working. But there's no other reports of that being the case in v1.10 or v1-10-stable branch.
Like #dlamblin mentioned and as mentioned in the docs too Airflow would create a single DagRun for the most recent valid interval. catchup=False will instruct the scheduler to only create a DAG Run for the most current instance of the DAG interval series.
Although there was a BUG when using a timedelta for schedule_interval instead of a CRON expression or CRON preset. This has been fixed in Airflow Master with https://github.com/apache/airflow/pull/8776. We will release Airflow 1.10.11 with this fix.
I know this thread is a little old. But, setting catch_up_default = False in airflow.cfg did stop airflow from backfilling for me.
(My Airflow version is 1.10.12)
I resent that this config is not set to False by default. This and the fact that the dag starts one schedule_interval after the start_date are the two most confusing things that stumps Airflow beginners.
The first time I used Airflow, I wasted one entire afternoon, trying to figure out why my test task which was scheduled to run every 5 mins was running at quick succession (say every 5-6 seconds). It took me a while to realize that it was backfill in action.

Resources