Airflow - How do I ignore succeeded tasks in a backfill? - airflow

I have added new tasks to my DAG and it needs to backfill them. At the moment, when I run airflow backfill it runs all the tasks(new ones and the old ones) and I would like to ignore the old tasks which have already succeeded.
Is there any way to skip the tasks with success state in a backfill?

As of Airflow version 1.8.1, successful tasks should not be scheduled by a backfill, see AIRFLOW-1124.
Note that you can also specify which tasks you want to run in a backfill:
-t TASK_REGEX, --task_regex TASK_REGEX
The regex to filter specific task_ids to backfill
(optional)
The ignore dependencies flag may also help you in case your new tasks depend on any old ones that may not have succeeded.
-i, --ignore_dependencies
Skip upstream tasks, run only the tasks matching the
regexp. Only works in conjunction with task_regex

Related

Not able to find my DAG in airflow WEB UI even though the dag is in correct folder

I have been trying past 2 days to resolve this. There is a DAG python script which I created and saved it in the dags folder in airflow which is being referred to in the "airflow.cfg" file. The other dags are getting updated except for one dag. I tried to restart scheduler and also tried to reset the airflow db using airflow db reset and then tried airflow db init once again but still the same issue exists.
Some ideas on what you could check:
Do all of your DAGs have a unique dag_id? (I lost a few hours to this once, if two dags have the same name, the scheduler will randomly pick one to display with every dag_dir_list_interval)
If you are using a the #dag decorator: are you calling the DAG below its definition? Like so:
from airflow.decorators import dag, task
from pendulum import datetime
#dag(
dag_id="unique_name",
start_date=datetime(2022,12,10),
schedule=None,
catchup=False
)
def my_dag():
#task
def say_hi():
return "hi"
say_hi()
# without this line the DAG will not show up in the UI
my_dag()
What is the output of airflow run dags list and airflow run dags list-import-errors ?
If you have a lot of DAGs in your environment you might want to increase the dagbag_import_timeout.
Does your DAG work if thrown into a new Airflow instance (the easiest way to check is by spinning up a project with the Astro CLI and putting the dag into the dags folder created by astro dev init)
Disclaimer: I work at Astronomer, who develops the Astro CLI as an OS project.

Airflow DAGs with Tasks

I have like 30 tasks in one DAG. At times, I may want to run each task separately.Could anyone please let me know whether I can run the 30 tasks separately on a need basis?
Also, looks like either I can create a DAG with all the 30 tasks or create separate DAG each with one task. Which one is better? When to use one DAG with many tasks and when to use one DAG with one task (ending up with many DAGs)
Thanks in advance!
You can run a single airflow task using the CLI
From the docs:
airflow tasks run [-h] [--cfg-path CFG_PATH] [--error-file ERROR_FILE] [-f]
[-A] [-i] [-I] [-N] [-l] [-m] [-p PICKLE] [--pool POOL]
[--ship-dag] [-S SUBDIR]
dag_id task_id execution_date_or_run_id
Choosing how to structure your dags and tasks will depend on the problem you are trying to solve

Accessing DAG configuration variable in its constructor

In Airflow 2.0, when creating a DAG using the DAG constructor - I would like to use one of its trigger configuration parameters for naming its dag_id.
For example, as I use the Google Cloud Composer environment, I have something like the following code snippets:
trigger_dag.sh
DAG_VERSION=some_dag_v1.0.0
TRIGGER_PARAMS='{"dag_version":"'"${DAG_VERSION}"'"}';
gcloud beta composer environments run "${AIRFLOW_ENV_NAME}" --location=us-central1 dags trigger -- "${DAG_VERSION}" --conf "${TRIGGER_PARAMS}";
dag.py
dag = DAG(
dag_id=conf.dag_version, ## <- How do I access DAG config variables here?
schedule_interval=conf.dag_schedule_interval)
If I were inside a Python operator, I would have probably defined conf = context['dag_run'].conf , where **context is given as an argument. However, I'm not sure that it's possible to do it that way when initially defining the DAG in the top level of dag.py.

How to stop DAG from backfilling? catchup_by_default=False and catchup=False does not seem to work and Airflow Scheduler from backfilling

The setting catchup_by_default=False in airflow.cfg does not seem to work. Also adding catchup=False to the DAG doesn't work neither.
Here's how to reproduce the issue. I always start from a clean slate by running airflow resetdb. As soon as I unpause the dag, the tasks start to backfill.
Here's the setup for the dag. I'm just using the tutorial example.
default_args = {
"owner": "airflow",
"depends_on_past": False,
"start_date": datetime(2018, 9, 16),
"email": ["airflow#airflow.com"],
"email_on_failure": False,
"email_on_retry": False,
"retries": 1,
"retry_delay": timedelta(minutes=5),
}
dag = DAG("tutorial", default_args=default_args, schedule_interval=timedelta(1), catchup=False)
To be clear if you enabled this DAG that you specified when the time now is 2018-10-22T9:00:00.000EDT (which is what, 2018-10-22T13:00:00.000Z) it would be would be started some time after 2018-10-22T13:00:00.000Z with a run date marked 2018-10-21T00:00:00.000Z.
This is not back filling from the start date, but without any prior run, it does "catchup" the most recent completed valid period; I'm not sure why that's been the case in Airflow for a while, but it's documented that catchup=False means create a single run of the very most recent valid period.
If the dagrun run date is further confusing to you, please recall that run dates are the execution_date which is the start of the interval period. The data for the interval is only completely available at the end of the interval period, but Airflow is designed to pass in the start of the period.
Then the next run would start sometime after 2018-10-23T00:00:00.000Z with an execution_date set as 2018-10-22T00:00:00.000Z.
If, on the 22nd or later, you're getting any run date earlier than the 21st, or multiple runs scheduled, then yes catchup=False is not working. But there's no other reports of that being the case in v1.10 or v1-10-stable branch.
Like #dlamblin mentioned and as mentioned in the docs too Airflow would create a single DagRun for the most recent valid interval. catchup=False will instruct the scheduler to only create a DAG Run for the most current instance of the DAG interval series.
Although there was a BUG when using a timedelta for schedule_interval instead of a CRON expression or CRON preset. This has been fixed in Airflow Master with https://github.com/apache/airflow/pull/8776. We will release Airflow 1.10.11 with this fix.
I know this thread is a little old. But, setting catch_up_default = False in airflow.cfg did stop airflow from backfilling for me.
(My Airflow version is 1.10.12)
I resent that this config is not set to False by default. This and the fact that the dag starts one schedule_interval after the start_date are the two most confusing things that stumps Airflow beginners.
The first time I used Airflow, I wasted one entire afternoon, trying to figure out why my test task which was scheduled to run every 5 mins was running at quick succession (say every 5-6 seconds). It took me a while to realize that it was backfill in action.

Possible to set different executor for each Airflow DAG?

I am looking to add another DAG to an existing Airflow server. The server is currently using LocalExecutor but I might want my DAG to use CeleryExecutor. It seems like the configuration file airflow.cfg only allows one executor:
# The executor class that airflow should use. Choices include
# SequentialExecutor, LocalExecutor, CeleryExecutor
executor = LocalExecutor
Is it possible to configure Airflow such that the existing DAGs can continue to use LocalExecutor and my new DAG can use CeleryExecutor or a custom executor class? I haven't found any examples of people doing this nor come across anything in the Airflow documentation.
If you have a SubDAG within your DAG, you can pass in a specific executor to that SubDagOperator. For instance, to use a SequentialExecutor:
bar_subdag = SubDagOperator(
task_id='bar',
subdag=my_subdag('foo', 'bar', default_args),
default_args=default_args,
dag=foo_dag,
executor=SequentialExecutor()
)
This is on 1.8, not sure if 1.9 is different.
Seems the scheduler will only start one instance of the executor.

Resources