Backfilling, Trigger dag w/ config - airflow

I have a dag that has been running for months.
I would like to use the UI to trigger a backfill.
My dag already has catchup=True.
{"start_date": "2020-01-01", "end_date" "2022-01-01"}
But it does not backfill as it should. It only start from today
Note, my previous runs are successful if that matters?

Related

Where to find Airflow SequentialExecutor worker error logs

I'm new to Airflow. I'm following the offical tutorial to set up the first DAG and task
from airflow import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'admin',
'retries': 3,
'retry_delay': timedelta(minutes=1)
}
with DAG(
dag_id="hello_world_dag",
description="Hello world DAG",
start_date=datetime(2023, 1, 16),
schedule_interval='#daily',
default_args=default_args
) as dag:
task1 = BashOperator(
task_id="hello_task",
bash_command="echo hello world!"
)
task1
When I tried to run this manually, it always failed. I've checked the web server logs and the scheduler logs, they don't have any obvious errors. I also checked the task run logs, it's empty.
The setup is pretty simple: SequentialExecutor with sqlite. My question is: where can I see the worker logs, or any other places that have any useful message logged?
Ok finally figured this out.
Firstly let me correct my question - there's actually an error raised in scheduler log that the "BashTaskRunner" cannot be loaded. So I searched Airflow's source code, and found it was renamed to StandardBashRunner like 3 years ago(link).
This is the only occurrence of the word BashTaskRunner in the whole repo. So I'm curious how the AIRFLOW_HOME/airflow.cfg is generated, which sets this as the default task_runner value.

Not able to find my DAG in airflow WEB UI even though the dag is in correct folder

I have been trying past 2 days to resolve this. There is a DAG python script which I created and saved it in the dags folder in airflow which is being referred to in the "airflow.cfg" file. The other dags are getting updated except for one dag. I tried to restart scheduler and also tried to reset the airflow db using airflow db reset and then tried airflow db init once again but still the same issue exists.
Some ideas on what you could check:
Do all of your DAGs have a unique dag_id? (I lost a few hours to this once, if two dags have the same name, the scheduler will randomly pick one to display with every dag_dir_list_interval)
If you are using a the #dag decorator: are you calling the DAG below its definition? Like so:
from airflow.decorators import dag, task
from pendulum import datetime
#dag(
dag_id="unique_name",
start_date=datetime(2022,12,10),
schedule=None,
catchup=False
)
def my_dag():
#task
def say_hi():
return "hi"
say_hi()
# without this line the DAG will not show up in the UI
my_dag()
What is the output of airflow run dags list and airflow run dags list-import-errors ?
If you have a lot of DAGs in your environment you might want to increase the dagbag_import_timeout.
Does your DAG work if thrown into a new Airflow instance (the easiest way to check is by spinning up a project with the Astro CLI and putting the dag into the dags folder created by astro dev init)
Disclaimer: I work at Astronomer, who develops the Astro CLI as an OS project.

Can this warning be avoided in apache airflow 2.0?

I am using airflow v2.0 on windows 10 WSL (Ubuntu 20.04).
The warning message is :
/home/jainri/.local/lib/python3.8/site-packages/airflow/models/dag.py:1342: PendingDeprecationWarning: The requested task could not be added to the DAG because a task with task_id create_tag_template_field_result is already in the DAG. Starting in Airflow 2.0, trying to overwrite a task will raise an exception.
warnings.warn(
Done.
Due to this warning, the dags showing in web UI are also some example dags included with apache airflow. I have setup **AIRFLOW_HOME** and it also picks up dags from there. But the list of example dags also displayed. I have posted the image of WEB UI also.
WebUI
This is the dag below that I am trying to run:
import datetime
import logging
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
#
# TODO: Define a function for the python operator to call
#
def greet():
logging.info("Hello Rishabh!!")
dag = DAG(
'lesson1.demo1',
start_date = datetime.datetime.now()
end_date
)
#
# TODO: Define the task below using PythonOperator
#
greet_task = PythonOperator(
task_id='greet_task',
python_callable=greet,
dag=dag
)
Also, the main issue is like the list of dags showing in webUI is some example dags. That shows up a huge list along with my own dags. Which makes it cumbersome to look for my own dags.
I found the issue, the error you are seeing is because of airflow/example_dags/example_complex.py (one of the example_dags) that is shipped with Airflow.
Disable loading of example_dags by setting AIRFLOW__CORE__LOAD_EXAMPLES=False as an environment variable or set [core] load_examples = False in airflow.cfg (docs).

Airflow DAGs tasks not running when i run DAG despite tasks working fine when i test them

I have the following DAG defined in code:
from datetime import timedelta, datetime
import airflow
from airflow import DAG
from airflow.operators.docker_operator import DockerOperator
from airflow.contrib.operators.ecs_operator import ECSOperator
default_args = {
'owner': 'airflow',
'retries': 1,
'retry_delay': timedelta(minutes=5),
'start_date': datetime(2018, 9, 24, 10, 00, 00)
}
dag = DAG(
'data-push',
default_args=default_args,
schedule_interval='0 0 * * 1,4',
)
colors = ['blue', 'red', 'yellow']
for color in colors:
ECSOperator(dag=dag,
task_id='data-push-for-%s' % (color),
task_definition= 'generic-push-colors',
cluster= 'MY_ECS_CLUSTER_ARN',
launch_type= 'FARGATE',
overrides={
'containerOverrides': [
{
'name': 'push-colors-container',
'command': [color]
}
]
},
region_name='us-east-1',
network_configuration={
'awsvpcConfiguration': {
'securityGroups': ['MY_SG'],
'subnets': ['MY_SUBNET'],
'assignPublicIp': "ENABLED"
}
},
)
This should create a DAG with 3 tasks, one for each color in my colors list.
This seems good, when i run:
airflow list_dags
I see my dag listed:
data-push
And when I run:
airflow list_tasks data-push
I see my three tasks appear as they should:
data-push-for-blue
data-push-for-red
data-push-for-yellow
I then test run one of my tasks by entering the following into terminal:
airflow run data-push data-push-for-blue 2017-1-23
And this runs the task, which I can see appear in my ECS cluster on the aws dashboard so I know for a fact the task runs on my ECS cluster and the data is pushed succesfully and everything is great.
Now when I try and run the DAG data-push from the Airflow UI is where i run into a problem.
I run:
airflow initdb
followed by:
airflow webserver
and now go into the airflow UI at localhost:8080.
I see the dag data-push in the list of dags, click it, and then to test run the entire dag i click the "Trigger DAG" button. I don't add any configuration json and then click 'Trigger'. The tree view for the DAG then shows a green circle on the right of the tree structure, seemingly indicating the DAG is 'running'. But the green circle just stays there for ages and when I manually check my ECS dashboard I see no tasks actually running so nothing is happening after triggering the DAG from the Airflow UI, despite the tasks working when i manually run them from the CLI.
I am using the SequentialExecutor if that matters.
My two main theories as to why the triggering the DAG does nothing when running the individual tasks from the CLI works are that maybe I am missing something in my python code where I define the dag (maybe because I dont specifiy any dependencies for the tasks?) or that I am not running the airflow scheduler but if I am manually triggering the DAGS from the Airflow UI i don't see why the scheduler would need to be running and why it wouldn't show me an error saying this is a problem.
Any ideas?
Sounds like you did not unpause your dag:
Toggle On/Off switch in the upper left of Web UI or using cli: airflow unpause <dag_id>.

How to stop DAG from backfilling? catchup_by_default=False and catchup=False does not seem to work and Airflow Scheduler from backfilling

The setting catchup_by_default=False in airflow.cfg does not seem to work. Also adding catchup=False to the DAG doesn't work neither.
Here's how to reproduce the issue. I always start from a clean slate by running airflow resetdb. As soon as I unpause the dag, the tasks start to backfill.
Here's the setup for the dag. I'm just using the tutorial example.
default_args = {
"owner": "airflow",
"depends_on_past": False,
"start_date": datetime(2018, 9, 16),
"email": ["airflow#airflow.com"],
"email_on_failure": False,
"email_on_retry": False,
"retries": 1,
"retry_delay": timedelta(minutes=5),
}
dag = DAG("tutorial", default_args=default_args, schedule_interval=timedelta(1), catchup=False)
To be clear if you enabled this DAG that you specified when the time now is 2018-10-22T9:00:00.000EDT (which is what, 2018-10-22T13:00:00.000Z) it would be would be started some time after 2018-10-22T13:00:00.000Z with a run date marked 2018-10-21T00:00:00.000Z.
This is not back filling from the start date, but without any prior run, it does "catchup" the most recent completed valid period; I'm not sure why that's been the case in Airflow for a while, but it's documented that catchup=False means create a single run of the very most recent valid period.
If the dagrun run date is further confusing to you, please recall that run dates are the execution_date which is the start of the interval period. The data for the interval is only completely available at the end of the interval period, but Airflow is designed to pass in the start of the period.
Then the next run would start sometime after 2018-10-23T00:00:00.000Z with an execution_date set as 2018-10-22T00:00:00.000Z.
If, on the 22nd or later, you're getting any run date earlier than the 21st, or multiple runs scheduled, then yes catchup=False is not working. But there's no other reports of that being the case in v1.10 or v1-10-stable branch.
Like #dlamblin mentioned and as mentioned in the docs too Airflow would create a single DagRun for the most recent valid interval. catchup=False will instruct the scheduler to only create a DAG Run for the most current instance of the DAG interval series.
Although there was a BUG when using a timedelta for schedule_interval instead of a CRON expression or CRON preset. This has been fixed in Airflow Master with https://github.com/apache/airflow/pull/8776. We will release Airflow 1.10.11 with this fix.
I know this thread is a little old. But, setting catch_up_default = False in airflow.cfg did stop airflow from backfilling for me.
(My Airflow version is 1.10.12)
I resent that this config is not set to False by default. This and the fact that the dag starts one schedule_interval after the start_date are the two most confusing things that stumps Airflow beginners.
The first time I used Airflow, I wasted one entire afternoon, trying to figure out why my test task which was scheduled to run every 5 mins was running at quick succession (say every 5-6 seconds). It took me a while to realize that it was backfill in action.

Resources