How Airflow determines when to re-import a DAG file? - airflow

I'm seeing some interesting behavior how Airflow treats an existing DAG file. For example, I have a DAG file like this:
# my_pipeline.py
from dag_gen import DagGenerator
g = globals()
g.update(DagGenerator.generate())
In the code above, DagGenerator.generate() returns a dictionary with two items:
{
'my_dag': DAG(...),
'my_op': DummyOperator(...),
}
Technically, this code should be equivalent to:
# my_pipeline_equivalent.py
my_dag = DAG(...)
my_op = DummyOperator(...)
For some reason, the file my_pipeline.py is not being picked up by Airflow while my_pipeline_equivalent.py can be picked up without any problem.
However, if I add the following code in my_pipeline.py, both dags (i.e. my_dag and my_dag2) can be picked up.
# my_pipeline.py
from dag_gen import DagGenerator
g = globals()
g.update(DagGenerator.generate())
# newly added
from airflow import DAG
from datetime import datetime
from airflow.operators.dummy_operator import DummyOperator
x = DAG(
'my_dag2',
schedule_interval="#daily",
start_date=datetime(2021, 1, 1),
catchup=False,
default_args={
"retries": 2, # If a task fails, it will retry 2 times.
},
tags=['example'],
)
b = DummyOperator(task_id='d3', dag=x)
What makes this stranger is if now I comment out the newly added part like below, my_dag can still be picked up while my_dag2 is gone.
# my_pipeline.py
from dag_gen import DagGenerator
g = globals()
g.update(DagGenerator.generate())
# newly added
#from airflow import DAG
#from datetime import datetime
#from airflow.operators.dummy_operator import DummyOperator
#x = DAG(
# 'my_dag2',
# schedule_interval="#daily",
# start_date=datetime(2021, 1, 1),
# catchup=False,
# default_args={
# "retries": 2, # If a task fails, it will retry 2 times.
# },
# tags=['example'],
#)
#b = DummyOperator(task_id='d3', dag=x)
However, if I actually delete the commented code, my_dag is gone as well. In order to add my_dag back, I have to add back my_dag2 code without the comment (the commented my_dag2 code, which worked previously, doesn't work now).
Could anyone help me understand what's going here? If I remember correctly, Airflow has some logic to determine when/if to import/re-import a Python file. Does anyone know where that logic lives in code?
Thanks.
Some additional find-outs. Once I have both DAGs (my_dag and my_dag2) being picked up. If I change my_pipeline.py to only keep the DAG import, my_dag can still be picked up, even if I commented out the import, but I cannot remove that line. If I do remove it, my_dag would be gone again.
# my_pipeline.py
from dag_gen import DagGenerator
g = globals()
g.update(DagGenerator.generate())
# from airflow import DAG
My guess is Airflow must be reading the python file and looking for that import as string.

If you follow the steps that Airflow takes to process the files, you will get to this line of code:
return all(s in content for s in (b'dag', b'airflow'))
Which means that the Airflow file processor will ignore files that don't contain the two words dag and airflow.
If you want to process your module, and where you already have the word dag, you can just add a comment in the beginning contains the word airflow:
# airflow
from dag_gen import DagGenerator
g = globals()
g.update(DagGenerator.generate())
Or just renaming the class DagGenerator to AirflowDagGenerator to solve the problem in all the modules.

Related

modify dag runs when triggered

I was curious if there's a way to customise the dag runs.
So I'm currently checking for updates for another table which gets updated manually by someone and once that's been updated, I would run my dag for the month.
At the moment I have created a branch operator that compares the dates of the 2 tables but is there a way to run the dag (compare the two dates) and run it everyday until there is a change and not run for the remaining of the month?
For example,
Table A (that is updated manually) has YYYYMM as 202209 and Table B also has YYYYMM as 202209.
Atm, my branch operator compares the two YYYYMM and would point to a dummy operator end when it's the same. However, when Table A has been updated to 202210, there's a difference in the two YYYYMM hence another task would run and overwrite Table B.
It all works but this would run the dag everyday even though the table A only gets updated once a month at a random point of time within the month. So is there way to trigger the dag to stop for the remaining days of the month after the task has been triggered?
Hope this is clear.
If you would be using data stored on S3 there would be easy solution starting from the version 2.4 - the Data-aware scheduling.
But probably you're not so there is another option.
A dag in Airflow is Dag object that is assigned to global scope. This allows for dynamic creation of dags. This implies each file is loaded on certain interval. A very good description with examples is here
Second thing you need to use is Airflow Variables
So the concept is as follows:
Create a variable in Airflow named dag_run that will hold the month when the dag has successfully run
Create a python file that has a function that creates a dag object based on input parameters.
In the same file use conditional statements that will set the 'schedule' param differently depending if the dag has run for current month
In your dag in the branch that executes when data has changed set the variable dag_run to the current months value like so: Variable.set(key='dag_run', value=datetime.now().month)
step 1:
python code:
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime
from airflow.models import Variable
#function that creates dag based on input
def create_dag(dag_id,
schedule,
default_args):
def hello_world_py(*args):
print('Hello World')
print('This is DAG: {}'.format(str(dag_id)))
dag = DAG(dag_id,
schedule_interval=schedule,
default_args=default_args)
with dag:
t1 = PythonOperator(
task_id='hello_world',
python_callable=hello_world_py)
return dag
#run some checks
current_month = datetime.now().month
dag_run_month = int(Variable.get('run_month'))
if current_month == dag_run_month:
# keep the schedule off
schedule = None
dag_id = "Database_insync"
elif current_month != dag_run_month:
# keep the schedule on
schedule = "30 * * * *"
dag_id = "Database_notsynced"
#watch out for start_date if you leave
#it in the past airflow will execute past missing schedules
default_args = {'owner': 'airflow',
'start_date': datetime.datetime.now() - datetime.timedelta(minutes=15)
}
globals()[dag_id] = create_dag(dag_id,
schedule,
default_args)

Airflow: Importing decorated Task vs all tasks in a single DAG file?

I recently started using Apache Airflow and one of its new concept Taskflow API. I have a DAG with multiple decorated tasks where each task has 50+ lines of code. So I decided to move each task into a separate file.
After referring stackoverflow I could somehow move the tasks in the DAG into separate file per task. Now, my question is:
Does both the code samples shown below work same? (I am worried about the scope of the tasks).
How will they share data b/w them?
Is there any difference in performance? (I read Subdags are discouraged due to performance issues, though this is not Subdags just concerned).
All the code samples I see in the web (and in official documentation) put all the tasks in a single file.
Sample 1
import logging
from airflow.decorators import dag, task
from datetime import datetime
default_args = {"owner": "airflow", "start_date": datetime(2021, 1, 1)}
#dag(default_args=default_args, schedule_interval=None)
def No_Import_Tasks():
# Task 1
#task()
def Task_A():
logging.info(f"Task A: Received param None")
# Some 100 lines of code
return "A"
# Task 2
#task()
def Task_B(a):
logging.info(f"Task B: Received param {a}")
# Some 100 lines of code
return str(a + "B")
a = Task_A()
ab = Task_B(a)
No_Import_Tasks = No_Import_Tasks()
Sample 2 Folder structure:
- dags
- tasks
- Task_A.py
- Task_B.py
- Main_DAG.py
File Task_A.py
import logging
from airflow.decorators import task
#task()
def Task_A():
logging.info(f"Task A: Received param None")
# Some 100 lines of code
return "A"
File Task_B.py
import logging
from airflow.decorators import task
#task()
def Task_B(a):
logging.info(f"Task B: Received param {a}")
# Some 100 lines of code
return str(a + "B")
File Main_Dag.py
from airflow.decorators import dag
from datetime import datetime
from tasks.Task_A import Task_A
from tasks.Task_B import Task_B
default_args = {"owner": "airflow", "start_date": datetime(2021, 1, 1)}
#dag(default_args=default_args, schedule_interval=None)
def Import_Tasks():
a = Task_A()
ab = Task_B(a)
Import_Tasks_dag = Import_Tasks()
Thanks in advance!
There is virtually no difference between the two approaches - neither from logic nor performance point of view.
The tasks in Airflow share the data between them using XCom (https://airflow.apache.org/docs/apache-airflow/stable/concepts/xcoms.html) effectively exchanging data via database (or other external storage). The two tasks in Airflow - does not matter if they are defined in one or many files - can be executed anyway on completely different machines (there is no task affinity in airflow - each task execution is totally separated from other tasks. So it does not matter - again - if they are in one or many Python files.
Performance should be similar. Maybe splitting into several files is very, very little slower but it should totally negligible and possibly even not there at all - depends on the deployment you have the way you distribute files etc. etc., but I cannot imagine this can have any observable impact.

Airflow: Rerunning DAG can't load XCOMs from previous run

Is there a way to persist an XCOM value during re-runs of a DAG step (after clearing the status)?
Below is a simplified version of what I'm trying to accomplish, namely when a DAG step status is cleared and the step re-run, I would like to be able to load the XCOM value pushed on the previous run. However, even though I can see the value in the XCOM interface, the value does not get pulled. I've looked through the source code for the pull_xcom() method but can't figure out where it is being filtered out.
The functionality I'm trying to achieve is to maintain some amount of state between failed runs of a DAG. In the example, this would mean that 1 is added to the stored value every time the DAG step is cleared and rerun.
from datetime import datetime
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
def test_step(**kwargs):
ti = kwargs.get('task_instance')
value = ti.xcom_pull(key='key', include_prior_dates=True)
if value is None:
value = 0
print(f'BEFORE VALUE: {value}')
value += 1
print(f'AFTER VALUE: {value}')
ti.xcom_push(key='key', value=value)
# Simulating a failure
raise Exception
default_args = {
'owner': 'Testing',
'depends_on_past': False,
'email': ['test#test.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 0,
}
dag = DAG(
'test_dag',
default_args=default_args,
schedule_interval=None,
start_date=datetime(2020, 4, 9),
)
t1 = PythonOperator(
task_id='test_step',
provide_context=True,
python_callable=test_step,
dag=dag,
)
t1
Anytime a task is about to run, its XCom is cleared for the current execution date (https://github.com/apache/airflow/blob/1.10.10/airflow/models/taskinstance.py#L960). This is why you won't ever pull values from previous task tries. Use of include_prior_dates=True only pulls from previous execution dates, but not previous runs of the same execution date.
One possible solution is to put a DummyOperator task upstream of your test_step task, called say xcom_store.test_step. Then use airflow.models.XCom.set() directly in test_step to your XCom values into the xcom_store.test_step task (reference xcom_push() as an example). When you need to pull, just pull as you usually would with but from the dummy task instead, i.e. ti.xcom_pull(task_ids='xcom_store.test_step', key='key'). Definitely not ideal and could lead to some confusion, but if you standardize it and build some helpers around it, it could be alright?

How to retry complete Airflow DAG?

I know that it is possible to retry individual tasks, but is it possible to retry complete DAG?
I create tasks dynamically, that is why I need to retry not specific task, but complete DAG. If it is not supported by Airflow, maybe there is some workaround.
I wrote the below script and scheduled it on airflow master to rerun the failed DAG runs for DAGs mentioned in "dag_ids_to_monitor" array
import subprocess
import re
from datetime import datetime
dag_ids_to_monitor = ['dag1','dag2','dag2']
def runBash(cmd):
print ("running bash command {}".format(cmd))
output = subprocess.check_output(cmd.split())
return output
def datetime_valid(dt_str):
try:
datetime.strptime(dt_str, '%Y-%m-%dT%H:%M:%S')
print(dt_str)
print(datetime.strptime(dt_str, '%Y-%m-%dT%H:%M:%S'))
except:
return False
return True
def get_schedules_to_rerun(dag_id):
bashCommand = f"airflow list_dag_runs --state failed {dag_id}"
output = runBash(bashCommand)
schedules_to_rerun = []
for line in output.split('\n'):
parts = re.split("\s*\|\s*", line)
if len(parts) > 4 and datetime_valid(parts[3][:-6]):
schedules_to_rerun.append(parts[3])
return schedules_to_rerun
def trigger_runs(dag_id, re_run_start_times):
for start_time in re_run_start_times:
runBash(f"airflow clear --no_confirm --start_date {start_time} --end_date {start_time} {dag_id}")
def rerun_failed_dag_runs(dag_id):
re_run_start_times = get_schedules_to_rerun(dag_id)
trigger_runs(dag_id,re_run_start_times)
for dag_id in dag_ids_to_monitor:
rerun_failed_dag_runs(dag_id)
If you have access to the Airflow UI, go to Graph view.
In graph view, individual tasks are marked as boxes and the DAG run as a whole is indicated by circles. Click on a circle and then the clear option. This will restart the entire run.
Alternatively you can go to the tree view and clear the first task in the DAG.
Go to Airflow UI, click on the first task(s) of your DAG, to the right of the "Clear" button choose "Downstream" and "Recursive" and after that press "Clear". This will mark the DAG as "Haven't yet run" and rerun it if the DAG schedule permits it

Airflow: re execute the jobs of a DAG for the past n days on a daily basis

I have scheduled the execution of a DAG to run daily.
It works perfectly for one day.
However each day I would like to re-execute not only for the current day {{ ds }} but also for the previous n days (let's say n = 7).
For example, in the next execution scheduled to run on "2018-01-30" I would like Airflow not only to run the DAG using as execution date "2018-01-30", but also to re-run the DAGs for all the previous days from "2018-01-23" to "2018-01-30".
Is there an easy way to "invalidate" the previous execution so that a backfill is run automatically?
You can generate dynamically tasks in a loop and pass the offset to your operator.
Here is an example with the Python one.
import airflow
from airflow.operators.python_operator import PythonOperator
from airflow.models import DAG
from datetime import timedelta
args = {
'owner': 'airflow',
'start_date': airflow.utils.dates.days_ago(2),
'schedule_interval': '0 10 * * *'
}
def check_trigger(execution_date, day_offset, **kwargs):
target_date = execution_date - timedelta(days=day_offset)
# use target_date
for day_offset in xrange(1, 8):
PythonOperator(
task_id='task_offset_' + i,
python_callable=check_trigger,
provide_context=True,
dag=dag,
op_kwargs={'day_offset' : day_offset}
)
Have you considered having the dag that runs once a day just run your task for the last 7 days? I imagine you’ll just have 7 tasks that each spawn a SubDAG with a different day offset from your execution date.
I think that will make debugging easier and history cleaner. I believe trying to backfill already executed tasks will involve deleting task instances or setting their states all to NONE. Then you’ll still have to trigger a backfill on those dag runs. It’ll be harder to track when things fail and just seems a bit messier.

Resources