Airflow - Questions on batch jobs and running a task in a DagRun multiple times - airflow

I am trying to solve the following problem with airflow:
I have a data pipeline where I want to run several processes on a number of excel documents (eg: 5,000 excel files a day). My idea for a DAG is below:
Task 1 = Take an excel file, and adds a new sheet to it.
Task 2 = Convert this returned excel to a PDF.
Task 1 and 2 in the DAG would call a processing tool running outside airflow via an API call (So the actual data processing isnt happening inside airflow).
I seem to be going around in circles with figuring out the best approach to this workflow. Some questions I keep having are:
Should each DagRun be one excel, or should the DagRun take in a batch
of excels?
If taking in a batch (which I presume is the correct approach), what is the recommend batch amount?
How would I pass the returned values from task 1 to task 2. Would it be an XCOM dictionary with a reference to each newly saved excel? I read somewhere that the max size of an xcom should be 48kb. So if i have a XCOM of 5,000 excel filepaths, that will probabaly be larger than 48kb.
The last, most tricky question I have is, I would obviously want to start processing task 2 as soon as even 1 excel from Task 1 had completed, because i wouldnt want to wait for the entire batch of Task 1 to complete before starting Task 2. How can I run Task 2, multiple times within the same DagRun for each new result that Task 1 produces? Or should Task 2 be its own DAG?
Am I approaching this problem the right way? How should I be tackling this problem?

Assumptions
I made some assumptions since I don't know all the details of the Excel file processing:
You cannot merge the Excel files since you need them separate.
Excel files are accessible from Airflow DAG (same filesystem or similar).
If something of that is not true, please clarify accordingly.
Answers
That being said, I'll first answer your questions and then comment on some thoughts:
I think you can do in batches, since using one run per file will be very slow (because of the scheduler time mostly, that will add time between Excel files processing). You're also not using all the available resources, so better push Airflow to be more busy.
The batch amount will depend on the processing load and the task design. From your question I assume you're thinking about having the batch inside the task, but if the service that process the Excel files could handle good parallelism, I'd rather recommend one task per Excel file. Having 5000 tasks (one for each file) will be a bad idea (because that'll be difficult so see in the UI), but the exact number of processes per batch depends on your resources and service SLA mostly.
From my experience I recommend using one task for everything, since you can call the service in parallel and right after the service completes, you can directly transform the Excel file in PDF.
This gets solved with the answer from question #3.
Solution overview
The solution I imagine is something like:
First task for checking existence of pending files. You can do a fork using a BranchPythonOperator (example here).
Then you have X parallel tasks to process Excel (call the service) and transform that to PDF. Could be one PythonOperator task. If you use Airflow 2, you can simply use #task() decorator to simplify the code. The X could be from 10 to 100 for example, depending on the resources and the service throughput.
Have a final task that triggers the DAG again to process more files. This could be implemented using a TriggerDagRunOperator (example here).

Related

Best way to organize these tasks into a DAG

I am new to airflow, I took some courses about it but did not come across any example for my use case, I would like to:
Fetch data from postgres (data is usually 1m+ rows assuming it is large for xcoms)
(In some cases process the data if needed but this can be usually done inside the query itself)
Insert data into oracle
I tend to see workflows like exporting data into a csv first (from postgres), then loading it into destination database. However, I feel like it would be best to do all these 3 tasks in a single python operator (for example looping with a cursor and bulk inserting) but not sure if this is suitable for airflow.
Any ideas on possible solutions to this situation? What is the general approach?
As you mentioned there are several options.
To name a few:
Doing everything in one python task.
Create a pipeline
Create a custom operator.
All approaches are valid each one has advantages and disadvantages.
First approach:
You can write a python function that uses PostgresqlHook to create a dataframe and then load it to oracle.
def all_in_one(context):
pg_hook = PostgresHook('postgres_conn')
df = pg_hook.get_pandas_df('SELECT * FROM table')
# do some transformation on df as needed and load to oracle
op = PyhtonOperator(task_id='all_in_one_task',
python_callable=all_in_one,
dag=dag
)
Advantages :
easy coding (for people who are used to write python scripts)
Disadvantages:
not suitable for large transfers as it's in memory.
If you need to backfill or rerun the entire function is executed. So if there is an issue with loading to oracle you will still rerun the code that fetch records from PostgreSQL.
Second approach:
You can implement your own MyPostgresqlToOracleTransfer with any logic you wish. This is useful if you want to reuse the functionality in different DAGs
Third approach:
Work with files (data lake like).
the file can be on local machine if you have only 1 worker, if not the file must be uploaded to a shared drive (S3, Google Storage, any other disk that can be accessed by all workers).
Possible pipeline can be:
PostgreSQLToGcs -> GcsToOracle
Depends on what service you are using, some of the required operators may already been implemented by Airflow.
Advantages :
Each task stand for itself thus if you successful exported the data to disk, in event of backfill / failure you can just execute the failed operators and not the whole pipe. You can also save the exported files in cold storage in case you will need to rebuild from history.
Suitable for large transfers.
Disadvantages:
Adding another service which is "not needed" (shared disk resource)
Summary
I prefer the 2nd/3rd approaches. I think it's more suitable to what Airflow provides and allow more flexibility.

Why is airflow not a data streaming solution?

Learning about Airflow and want to understand why it is not a data streaming solution.
https://airflow.apache.org/docs/apache-airflow/1.10.1/#beyond-the-horizon
Airflow is not a data streaming solution. Tasks do not move data from
one to the other (though tasks can exchange metadata!)
Not sure what the it means task do not move data from one to another. So I can't have tasks like Extract Data >> Calculate A >> Calculate B using data from prev step >> another step that depends on prev task's result >> ... ?
Furthermore read the following a few times and still doesn't click with me. What's an example of a static vs dynamic workflow?
Workflows are expected to be mostly static or slowly changing. You can think of the structure of the tasks in your workflow as slightly more dynamic than a database structure would be. Airflow workflows are expected to look similar from a run to the next, this allows for clarity around unit of work and continuity.
Can someone help provide an alternative explanation or example that can walk through why Airflow is not a good data streaming solution?
Not sure what the it means task do not move data from one to another. So I can't have tasks like Extract Data >> Calculate A >> Calculate B using data from prev step >> another step that depends on prev task's result >> ... ?
You can use XCom (cross-communication) to use data from previous step BUT bear in mind that XCom is stored in the metadata DB meaning that you should not use XCom for holding data that is being processed. You should use XCom for holding a reference to data (e.g.: S3/HDFS/GCS path, BigQuery table, etc).
So instead of passing data between tasks, in Airflow you pass reference to data between tasks using XCom. Airflow is an orchestrator, heavy tasks should be off-loaded to other data processing cluster (can be Spark cluster, BigQuery, etc).

Airflow dynamic dag creation

Someone please tell me whether a DAG in airflow is just a graph (like a placeholder) without any actual data (like arguments) associated with it OR a DAG is like an instance (for a fixed argument)?
I want a system where the set of operations to perform (given a set of arguments) are fixed. But this input will be different everytime the set of operations are run. In simple terms, the pipeline is the same but the arguments to the pipeline will be different everytime it is run.
I want to know how to configure this in airflow? Should I create a new DAG for every new set of arguments? or any other method?
In my case, the graph is the same but want to run it on different data (from different users) as they come. So, should I create a new DAG everytime for new data?
Yes you are correct; A DAG is basically kind off a one-way graph. You can create a DAG once by chaining together multiple operators together to form your "structure".
Each operator, can then take in multiple arguments that you can pass from the DAG definition file itself (if needed).
Or you can pass in a configuration object to the DAG, and access custom data from there using the context.
I would recommend reading the Airflow Docs for more examples: https://airflow.apache.org/concepts.html#tasks
You can think of Airflow DAG as a program made of other programs, with the exception that it can't contain loops(acyclic). Will you change your program every time input data changes? Of course, it all depends on how you write your program, but usually you'd like you program to generalise, right? You don't want two different programs to do 2+2 and 3+3. But you'll have different programs to show Facebook pages and to play Pokemon Go. If you want to do the same thing to a similar data then you want to write your DAG once, and maybe only change environment arguments(DB connection, date, etc) - Airflow is perfectly suitable for that.
You do not need to create a new DAG every time, if the structure of the graph is the same.
Airflow DAGs are created via code, so you are free to create a code structure that allows you to pass in arguments each time. How you do that will require some creative thinking.
You could, for example, create a web form that accepts the arguments, stores them in a DB and then schedules the DAG with the Airflow restAPI. The DAG code would then need to be written to retrieve params from the database.
There are several other ways to accomplish what you are asking, they all just depend on your use case.
One caveat, the Airflow scheduler does not perform well if you change the start date of the DAG. For your idea above you will need to set the start date earlier than your first DAG run and then set the schedule interval to off. This way you have a start date that doesn’t change and dynamically triggered DAG runs.

Is Airflow a good fit for DAG that doesn’t care about execution date/time?

The API in Airflow seems to suggest it is build around backfilling, catching up and scheduling to run regularly in interval.
I have an ETL that extract data on S3 with the versions of the previous node (where the data comes from) in DAG. For example, here are the nodes of the DAG:
ImageNet-mono
ImageNet-removed-red
ImageNet-mono-scaled-to-100x100
ImageNet-removed-red-scaled-to-100x100
where ImageNet-mono is the previous node of ImageNet-mono-scaled-to-100x100 and
where ImageNet-removed-red is the previous node of ImageNet-removed-red-scaled-to-100x100
Both of them go through transformation of scaled-to-100x100 pipeline but producing different data since the input is different.
As you can see there is no date is involved. Is Airflow a good fit?
EDIT
Currently, the graph is simple enough to be managed manually with less than 10 nodes. They won't run in regularly interval. But instead as soon as someone update the code for a node, I would have to run the downstream nodes manually one by one python GetImageNet.py removed-red and then python scale.py 100 100 ImageNet-removed-redand then python scale.py 100 100 ImageNet-mono. I am looking into a way to manage the graph with a way to one click to trigger the run.
I think it's fine to use Airflow as long as you find it useful to use the DAG representation. If your DAG does not need to be executed on a regular schedule, you can set the schedule to None instead of a crontab. You can then trigger your DAG via the API or manually via the web interface.
If you want to run specific tasks you can trigger your DAG and mark tasks as success or clear them using the web interface.

synchronize multiple map reduce jobs in hadoop

I have a use case where multiple jobs can run at the same time. The output of all the jobs will have to merged with a common master file in HDFS(containing key value pairs) that has no duplicates. I'm not sure how to avoid the race condition that could crop up in this case. As an example both Job 1 and Job 2 simultaneously write the same value to the master file resulting in duplicates. Appreciate your help on this.
Apache Hadoop doesn't support parallel writing to the same file. Here is the reference.
Files in HDFS are write-once and have strictly one writer at any time.
So, multiple maps/jobs can't write to the same file simultaneously. Another job/shell or any other program has to be written to merge the output of multiple jobs.

Resources