Is Airflow a good fit for DAG that doesn’t care about execution date/time? - airflow

The API in Airflow seems to suggest it is build around backfilling, catching up and scheduling to run regularly in interval.
I have an ETL that extract data on S3 with the versions of the previous node (where the data comes from) in DAG. For example, here are the nodes of the DAG:
ImageNet-mono
ImageNet-removed-red
ImageNet-mono-scaled-to-100x100
ImageNet-removed-red-scaled-to-100x100
where ImageNet-mono is the previous node of ImageNet-mono-scaled-to-100x100 and
where ImageNet-removed-red is the previous node of ImageNet-removed-red-scaled-to-100x100
Both of them go through transformation of scaled-to-100x100 pipeline but producing different data since the input is different.
As you can see there is no date is involved. Is Airflow a good fit?
EDIT
Currently, the graph is simple enough to be managed manually with less than 10 nodes. They won't run in regularly interval. But instead as soon as someone update the code for a node, I would have to run the downstream nodes manually one by one python GetImageNet.py removed-red and then python scale.py 100 100 ImageNet-removed-redand then python scale.py 100 100 ImageNet-mono. I am looking into a way to manage the graph with a way to one click to trigger the run.

I think it's fine to use Airflow as long as you find it useful to use the DAG representation. If your DAG does not need to be executed on a regular schedule, you can set the schedule to None instead of a crontab. You can then trigger your DAG via the API or manually via the web interface.
If you want to run specific tasks you can trigger your DAG and mark tasks as success or clear them using the web interface.

Related

Airflow - create dynamic tasks based on config

I would like to manually trigger Airflow dag and pass parameters in this call. How to create DAG tasks based on the passed parameters?
Unfortunately, you can't dynamically create tasks in the sense you are asking now. Meaning, you can not add or delete tasks for a single run based on parameters that you give it. You always add the entire DAG, meaning also any consecutive DagRuns.
You have two options that might still get you to a satisfactory solution though.
Mapped tasks as explained here; https://airflow.apache.org/docs/apache-airflow/2.3.0/concepts/dynamic-task-mapping.html#simple-mapping
Generate a DAG based on some external resource. Example; we have a library on which we call my_config.get_dag() which will then create the DAG based on a json file.

Airflow DAGs recreation in each operator

We are using Airflow 2.1.4 and running on Kubernetes.
We have separated pods for web-server, scheduler and we are using Kubernetes executors.
We are using variety of operator such as PythonOperator, KubernetesPodOperator etc.
Our setup handles ~2K customers (businesses) and each one of them has it's own DAG.
Our code looks something like:
def get_customers():
logger.info("querying database to get all customers")
return sql_connection.query(SELECT id, name, offset FROM customers_table)
customers = get_customers()
for id, name, offset in customers:
dag = DAG(
dag_id=f"{id}-{name}",
schedule_interval=offset,
)
with dag:
first = PythonOperator(..)
second = KubernetesPodOperator(..)
third = SimpleHttpOperator(..)
first >> second >> third
globals()[id] = dag
In the snippet above is a simplified version of what we've got, but we have a few dozens of operators in the DAG (and not just three).
The problem is that for each one of the operators in each one of the DAGs we see the querying database to get all customers log - which means that we query the database a way more than we want to.
The database doesn't updated frequently and we can update the DAGs only once-twice a day.
I know that the DAGs are being saved in the metadata database or something..
Is there a way to build those DAGs only one time / via scheduler and not to do that per operator?
Should we change the design to support our multi-tenancy requirement? Is there a better option than that?
In our case, ~60 operators X ~2,000 customers = ~120,000 queries to the database.
Yes this is entirely expected. The DAGs are parsed by Airflow regularly (evey 30 second by default) so any top-level code (the one that is executed during parsing the file rather than "execute" methods of operators) is executed then.
Simple answer (and best practice) is "do not use any heavy operations in the top-level code of your DAGs". Specifically do not use DB queries. But if you want some more specific answers and possible ways how you can handle it, there are dedicated chapters about it in Airflow documentation about best practices:
This is explanation why Top-Level code should be "light" https://airflow.apache.org/docs/apache-airflow/stable/best-practices.html#top-level-python-code
This one is about strategies you might use to avoid "heavy" operations in Top-level code when you do dynamic DAG generation as you do in your case: https://airflow.apache.org/docs/apache-airflow/stable/best-practices.html#dynamic-dag-generation
In short there are three proposed ways:
using env variables
generating a configuration file (for example .json) from your DB automatically (periodically) by an external script and putting it next to your DAG and reading the json file by your DAG from there rather than using sql query.
generating many DAG python files dynamically (for exmple using JINJA) also automatically and periodically using an external script.
You could use either 2) or 3) to achive your goal I believe.

Airflow - Questions on batch jobs and running a task in a DagRun multiple times

I am trying to solve the following problem with airflow:
I have a data pipeline where I want to run several processes on a number of excel documents (eg: 5,000 excel files a day). My idea for a DAG is below:
Task 1 = Take an excel file, and adds a new sheet to it.
Task 2 = Convert this returned excel to a PDF.
Task 1 and 2 in the DAG would call a processing tool running outside airflow via an API call (So the actual data processing isnt happening inside airflow).
I seem to be going around in circles with figuring out the best approach to this workflow. Some questions I keep having are:
Should each DagRun be one excel, or should the DagRun take in a batch
of excels?
If taking in a batch (which I presume is the correct approach), what is the recommend batch amount?
How would I pass the returned values from task 1 to task 2. Would it be an XCOM dictionary with a reference to each newly saved excel? I read somewhere that the max size of an xcom should be 48kb. So if i have a XCOM of 5,000 excel filepaths, that will probabaly be larger than 48kb.
The last, most tricky question I have is, I would obviously want to start processing task 2 as soon as even 1 excel from Task 1 had completed, because i wouldnt want to wait for the entire batch of Task 1 to complete before starting Task 2. How can I run Task 2, multiple times within the same DagRun for each new result that Task 1 produces? Or should Task 2 be its own DAG?
Am I approaching this problem the right way? How should I be tackling this problem?
Assumptions
I made some assumptions since I don't know all the details of the Excel file processing:
You cannot merge the Excel files since you need them separate.
Excel files are accessible from Airflow DAG (same filesystem or similar).
If something of that is not true, please clarify accordingly.
Answers
That being said, I'll first answer your questions and then comment on some thoughts:
I think you can do in batches, since using one run per file will be very slow (because of the scheduler time mostly, that will add time between Excel files processing). You're also not using all the available resources, so better push Airflow to be more busy.
The batch amount will depend on the processing load and the task design. From your question I assume you're thinking about having the batch inside the task, but if the service that process the Excel files could handle good parallelism, I'd rather recommend one task per Excel file. Having 5000 tasks (one for each file) will be a bad idea (because that'll be difficult so see in the UI), but the exact number of processes per batch depends on your resources and service SLA mostly.
From my experience I recommend using one task for everything, since you can call the service in parallel and right after the service completes, you can directly transform the Excel file in PDF.
This gets solved with the answer from question #3.
Solution overview
The solution I imagine is something like:
First task for checking existence of pending files. You can do a fork using a BranchPythonOperator (example here).
Then you have X parallel tasks to process Excel (call the service) and transform that to PDF. Could be one PythonOperator task. If you use Airflow 2, you can simply use #task() decorator to simplify the code. The X could be from 10 to 100 for example, depending on the resources and the service throughput.
Have a final task that triggers the DAG again to process more files. This could be implemented using a TriggerDagRunOperator (example here).

Airflow dynamic dag creation

Someone please tell me whether a DAG in airflow is just a graph (like a placeholder) without any actual data (like arguments) associated with it OR a DAG is like an instance (for a fixed argument)?
I want a system where the set of operations to perform (given a set of arguments) are fixed. But this input will be different everytime the set of operations are run. In simple terms, the pipeline is the same but the arguments to the pipeline will be different everytime it is run.
I want to know how to configure this in airflow? Should I create a new DAG for every new set of arguments? or any other method?
In my case, the graph is the same but want to run it on different data (from different users) as they come. So, should I create a new DAG everytime for new data?
Yes you are correct; A DAG is basically kind off a one-way graph. You can create a DAG once by chaining together multiple operators together to form your "structure".
Each operator, can then take in multiple arguments that you can pass from the DAG definition file itself (if needed).
Or you can pass in a configuration object to the DAG, and access custom data from there using the context.
I would recommend reading the Airflow Docs for more examples: https://airflow.apache.org/concepts.html#tasks
You can think of Airflow DAG as a program made of other programs, with the exception that it can't contain loops(acyclic). Will you change your program every time input data changes? Of course, it all depends on how you write your program, but usually you'd like you program to generalise, right? You don't want two different programs to do 2+2 and 3+3. But you'll have different programs to show Facebook pages and to play Pokemon Go. If you want to do the same thing to a similar data then you want to write your DAG once, and maybe only change environment arguments(DB connection, date, etc) - Airflow is perfectly suitable for that.
You do not need to create a new DAG every time, if the structure of the graph is the same.
Airflow DAGs are created via code, so you are free to create a code structure that allows you to pass in arguments each time. How you do that will require some creative thinking.
You could, for example, create a web form that accepts the arguments, stores them in a DB and then schedules the DAG with the Airflow restAPI. The DAG code would then need to be written to retrieve params from the database.
There are several other ways to accomplish what you are asking, they all just depend on your use case.
One caveat, the Airflow scheduler does not perform well if you change the start date of the DAG. For your idea above you will need to set the start date earlier than your first DAG run and then set the schedule interval to off. This way you have a start date that doesn’t change and dynamically triggered DAG runs.

Best way to copy tasks from one DAG into another?

Say I have two pre-existing DAGs, A and B. Is it possible in Airflow to "copy" all tasks from B into A, preserving dependencies and propagating default arguments and the like for all of B's tasks? With the end goal being to have a new DAG A' that contains all of the tasks of both A and B.
I understand that it might not be possible or feasible to reconcile DAG-level factors, e.g. scheduling, and propagate across the copying, but is it possible to at least preserve the dependency orderings of the tasks such that each task runs as expected, when expected—just in a different DAG?
If it is possible, what would be the best way to do so? If it's not supported, is there work in progress to support this sort of "native DAG composition"?
UPDATE-1
Based on clarification of question expressed, I infer that requirement is not to replicate a DAG into another but to append it after another DAG.
The techniques mentioned in the original answer below are still applicable (to variable extent)
But for this specific use-case there are few more options
i. Use TriggerDagRunOperator: Invoke your 2nd DAG at the end of 1st DAG
ii. Use SubDagOperator: Wrap your 2nd DAG into a Sub-Dag and attach it at the end of 1st DAG
But do checkout Wiring top-level DAGs together thread (question / answer plus comments) for ideas / loopholes in each of above mentioned techniques
ORIGINAL ANSWER
I can think of 3 possible ways
The recommended way would be to programmatically construct your DAG. In other words, if possible, iterate over a list of configs (each config for one task) read from an external source (such as Airflow Variable, database or JSON files) and build your DAG as per your business logic. Here, you'll just have to alter the dag_id and you can re-use the same script to build identical DAG as your original one
A modification of 1st approach above is to generalize your dag-construction logic by employing a simple idea like ajbosco/dag-factory or a full-fleged wrapper framework like etsy/boundary-layer
Finally if none of the above approaches are easily adaptable for you, then you can hand-code the task-replication logic to regenerate the same structure as your original DAG. You can write a single robust script and re-use it across your entire project to replicate DAGs as and when needed. Here you'll have to go through DAG-traversal and some traditional data-structure and algorithmic stuff. Here's an example of BFS-like traversal over tasks of an Airflow DAG

Resources