In Airflow Data-aware scheduling, how could I do ""when one of the datasets in the schedule list is updated, then the DAG will be scheduled"? - airflow

According to Airflow document:
https://airflow.apache.org/docs/apache-airflow/2.4.0/concepts/datasets.html#multiple-datasets
Multiple Datasets
As the schedule parameter is a list, DAGs can require multiple datasets, and the DAG will be scheduled once all datasets it consumes have been updated at least once since the last time it was run:
with DAG(
dag_id='multiple_datasets_example',
schedule=[
example_dataset_1,
example_dataset_2,
example_dataset_3,
],
...,
):
All the datasets in the schedule updated, then this consumer dag will be scheduled. This is the behavior of this dag.
What if I want "when one of the datasets in the schedule list is updated, then the DAG will be scheduled.", I could not use this approach, I would have to use the more traditional TriggerDagRunOperator?

Yes, you are correct that as of now (Airflow 2.5.0) if you provide more than one dataset to the schedule parameter the DAG will wait for all of them to be updated before running. Having more configuration around this behavior is something that is being discussed, I found a reference to it on AIP48 in the future work section (second to last paragraph). But yes for now you'd have to use the TriggerDagRunOperator.

Related

run 2 scripts in same DAG with different schedule

Let's say you have 2 scripts: Daily_summary.py and Weekly_summary.py.
You could create 2 separate DAGs with daily and weekly schedules, but is it possible to solve this with 1 DAG?
I've tried a daily schedule, and simply putting this at the bottom (simplified):
if datetime.today().strftime('%A') == 'Sunday':
SSHOperator(run weekly_summary.py)
But problem is that if it is still running on Sunday at midnight, airflow will terminate this task since the Operator no longer exists on Monday.
If I could somehow get the execution day's day-of-the-week, that would solve it, but with Jinja templating '{{ds}}' it is not actually a text of 'yyyy-mm-dd', so cannot change it to date with datetime package. It only becomes date format somehow AFTER the airflow script gets executed
You shoudl dynamically generate two DAGs. But you can reuse the same code for that. This is the power of airflow - this is Python code, so you can easily use the same code to generate same DAG "structure" but with two diferent dag ids, and two different schedules.
See this nice article from Astronomer with some examples: https://www.astronomer.io/guides/dynamically-generating-dags

Airflow DAGs recreation in each operator

We are using Airflow 2.1.4 and running on Kubernetes.
We have separated pods for web-server, scheduler and we are using Kubernetes executors.
We are using variety of operator such as PythonOperator, KubernetesPodOperator etc.
Our setup handles ~2K customers (businesses) and each one of them has it's own DAG.
Our code looks something like:
def get_customers():
logger.info("querying database to get all customers")
return sql_connection.query(SELECT id, name, offset FROM customers_table)
customers = get_customers()
for id, name, offset in customers:
dag = DAG(
dag_id=f"{id}-{name}",
schedule_interval=offset,
)
with dag:
first = PythonOperator(..)
second = KubernetesPodOperator(..)
third = SimpleHttpOperator(..)
first >> second >> third
globals()[id] = dag
In the snippet above is a simplified version of what we've got, but we have a few dozens of operators in the DAG (and not just three).
The problem is that for each one of the operators in each one of the DAGs we see the querying database to get all customers log - which means that we query the database a way more than we want to.
The database doesn't updated frequently and we can update the DAGs only once-twice a day.
I know that the DAGs are being saved in the metadata database or something..
Is there a way to build those DAGs only one time / via scheduler and not to do that per operator?
Should we change the design to support our multi-tenancy requirement? Is there a better option than that?
In our case, ~60 operators X ~2,000 customers = ~120,000 queries to the database.
Yes this is entirely expected. The DAGs are parsed by Airflow regularly (evey 30 second by default) so any top-level code (the one that is executed during parsing the file rather than "execute" methods of operators) is executed then.
Simple answer (and best practice) is "do not use any heavy operations in the top-level code of your DAGs". Specifically do not use DB queries. But if you want some more specific answers and possible ways how you can handle it, there are dedicated chapters about it in Airflow documentation about best practices:
This is explanation why Top-Level code should be "light" https://airflow.apache.org/docs/apache-airflow/stable/best-practices.html#top-level-python-code
This one is about strategies you might use to avoid "heavy" operations in Top-level code when you do dynamic DAG generation as you do in your case: https://airflow.apache.org/docs/apache-airflow/stable/best-practices.html#dynamic-dag-generation
In short there are three proposed ways:
using env variables
generating a configuration file (for example .json) from your DB automatically (periodically) by an external script and putting it next to your DAG and reading the json file by your DAG from there rather than using sql query.
generating many DAG python files dynamically (for exmple using JINJA) also automatically and periodically using an external script.
You could use either 2) or 3) to achive your goal I believe.

Airflow 1.10.10 Schedule change for DAG

I am using Airflow 1.10.10 and wanted to know how to change a Aiflow DAG schedule . I checked online and in most of the comments its suggested that to change schedule of a DAG, create a new DAG with new dag_id, or change dag_id of existing DAG and give new schedule_interval . Attempt to change schedule of a existing DAG will not work in straight forward manner and will throw error or might create scheduling error.
However I tried to test this so that I can create the scenario where my DAG schedule change leads to erroneous cases . This I tried by only change schedule_interval in DAG file. I tried below change of schedule in my DAG and all worked as expected. Schedule was changed properly and no erroneous case was found .
Started with #Daily
Changed to 10 min
Changed to 17 min
Changed to 15 min
Changed to 5 min
Can someone please clarify what kind of problem may arise if we change the schedule_interval in a DAG without changing ID.
I do see this recommendation on the old Airflow Confluence page on Common Pitfalls.
When needing to change your start_date and schedule interval, change the name of the dag (a.k.a. dag_id) - I follow the convention : my_dag_v1, my_dag_v2, my_dag_v3, my_dag_v4, etc...
Changing schedule interval always requires changing the dag_id, because previously run TaskInstances will not align with the new schedule interval
Changing start_date without changing schedule_interval is safe, but changing to
an earlier start_date will not create any new DagRuns for the time
between the new start_date and the old one, so tasks will not
automatically backfill to the new dates. If you manually create
DagRuns, tasks will be scheduled, as long as the DagRun date is after
both the task start_date and the dag start_date.
I don't know the author's intent but I imagine changing the schedule_interval can cause confusion for users. When they revisit these task, they will wonder why the current schedule_interval does not match past task executions because that information is not stored at the task level.
Changing the schedule_interval does not impact past dagruns or tasks. The change will affect when new dagruns are created, which impacts the tasks within those dagruns.
I personally do not modify the dag_id when I update a DAG's scheduler_interval for two reasons.
If I keep the previous DAG, I am unnecessarily inducing more stress on the scheduler for processing a DAG that will not be turned on.
If I do not keep the previous DAG, I essentially lose all the history of the dagrun where it had a different schedule_interval.
Edit: Looks like there is an Github Issue created to move the Common Pitfall page but it is stale.

Airflow DAG B runs many times after DAG A is completed for that day

DAG A has schedule '0 6 * * *'.
DAG B has schedule '*/5 * * * *'.
However, DAG B should only start running for that day once DAG A has completed for that day.
I've played around with SubDags and ExternalTaskSensor but haven't yet found a satisfactory solution and I'm sure I'm missing something good. Recommendations?
Edit: say DAG A is my ETL. DAG B has some tasks that query my database and require that data to be up-to-date. DAG B gets run throughout the day, but only once the ETL is completed.
I can see using ShortCircuitOperator, for example, and having the condition be "DAG A that ran today is completed." But how could I write this condition?
This question is not an exact duplicate but is similar to another which already has 3 good answers: Scheduling dag runs in Airflow.
I recommend reading through all of them, but to summarize the info in the answers over there, there are several viable options for the use case of a DAG dependent upon another DAG:
TriggerDagRunOperator
BranchPythonOperator
ShortCircuitOperator
SubDagOperator / SubDAGs
With any of these options you may want to experiment with the trigger rule
External triggers (possibly less relevant for your use case)
If you can add more detail around the use case you're trying to accomplish, I could give more specific guidance as well.
Use TriggerDagRunOperator for calling a DAG to run after another. Refer to this question. Afraid I cannot provide a satisfactory example as I have not used it yet.

Is Airflow a good fit for DAG that doesn’t care about execution date/time?

The API in Airflow seems to suggest it is build around backfilling, catching up and scheduling to run regularly in interval.
I have an ETL that extract data on S3 with the versions of the previous node (where the data comes from) in DAG. For example, here are the nodes of the DAG:
ImageNet-mono
ImageNet-removed-red
ImageNet-mono-scaled-to-100x100
ImageNet-removed-red-scaled-to-100x100
where ImageNet-mono is the previous node of ImageNet-mono-scaled-to-100x100 and
where ImageNet-removed-red is the previous node of ImageNet-removed-red-scaled-to-100x100
Both of them go through transformation of scaled-to-100x100 pipeline but producing different data since the input is different.
As you can see there is no date is involved. Is Airflow a good fit?
EDIT
Currently, the graph is simple enough to be managed manually with less than 10 nodes. They won't run in regularly interval. But instead as soon as someone update the code for a node, I would have to run the downstream nodes manually one by one python GetImageNet.py removed-red and then python scale.py 100 100 ImageNet-removed-redand then python scale.py 100 100 ImageNet-mono. I am looking into a way to manage the graph with a way to one click to trigger the run.
I think it's fine to use Airflow as long as you find it useful to use the DAG representation. If your DAG does not need to be executed on a regular schedule, you can set the schedule to None instead of a crontab. You can then trigger your DAG via the API or manually via the web interface.
If you want to run specific tasks you can trigger your DAG and mark tasks as success or clear them using the web interface.

Resources