airflow create sub process based on number of files - airflow

A newbie question in airflow,
I am having a list of 100+ servers in a text file. Currently, a python script is used to login to each server, read a file, and write the output. It's taking a long time to get the output. If this job is converted to Airflow DAG, is it possible to split the servers into multiple groups and a new task can be initiated by using any operators? Or this can be achieved by only modifying the Python script(like using async) and execute using the Python operator. Seeking advice/best practice. I tried searching for examples but was not able to find one. Thanks!

Airflow is not really a "map-reduce" type of framework (which you seem to be trying to implement). The tasks of Airflow are not (at least currently) designed to split the work between them. This is very atypical for Airflow to have N tasks that do the same thing on a subset of data each. Airflow is more for orchestrating the logic, so each task in Airflow conceptually does a different thing and there are rarely cases where N parallel task do the same thing (but on a different subset of data). More often than not Airflow "tasks" do not "do" the job themselves, they are rather telling others what to do and wait until this gets done.
Typically Airflow can be used to orchestrate such services which excel in doing this kind of processing - you could have a Hadoop job which processes such "parallel" map-reduce kind of jobs using other tools. You could also - as you mentioned - perform an async, multi-threading or even multi-processing python operator, but at some scale, I think typically other, dedicated tools should be much easier to use and better to get the most value of (with efficient utilization of parallelism for example).

Related

How is Flyte tailored to "Data and Machine Learning"?

https://flyte.org/ says that it is
The Workflow Automation Platform for Complex, Mission-Critical Data and Machine Learning Processes at Scale
I went through quite a bit of documentation and I fail to see why it is "Data and Machine Learning". It seem to me that it is a workflow manager on top of a container orchastration (here Kubernetes), where workflow manager means, that I can define a Directed Acyclic Graphs (DAG) and then the DAG nodes are deployed as containers and the DAG is run.
Of course this is usefull and important for "Data and Machine Learning", but I might as well use it for any other microservice DAG with this. Except for features/details, how is this different than https://airflow.apache.org or other workflow managers (of which there are many). There are even more specialized workflow managers for "Data and Machine Learning", e.g., https://spark.apache.org.
What should I keep in mind as a Software Achitect?
That is a great question. You are right in one thing, at the core it is a Serverless Workflow Orchestrator (serverless, because it does bring up the infrastructure to run the code). And yes it can be used for multiple other situations. It may not be the best tool for some other systems like Micro-service orchestration.
But, what really makes it good for ML & Data Orchestration is a combination of
Features (list below) &
Integrations (list below)
Community of folks using it
Roadmap
Features
Long running tasks: It is designed for extremely long running tasks. Tasks that can run for days and weeks, even if the control plane goes down, you will not lose the work. You can keep deploying without impacting existing work.
Versioning - allow multiple users to work independently on the same
workflow, use different libraries, models, inputs etc
Memoization. Lets take an example of a pipeline with 10 steps, you can memoize all 9 steps and if 10th fails or you can modify 10th and then it will reuse results from previous 9. This leads to drastically faster iteration
Strong typing and ML specific type supports
Flyte understands dataframes and is able to translate dataframes from spark.dataFrame -> pandas.DataFrame -> Modin -> polars etc without the user having to think about how to do it efficiently. Also supports things like tensors (correctly serialized), numpy arrays, etc. Also models can be saved and retrieved from past executions so is infact the model truth store
Native support for Intra task checkpointing. This can help is recovering model training between node failures and across executions even. With new support being added for Checkpointing callbacks.
Flyte decks: A way to visualize metrics like ROC curve, etc or auto visualization of the distribution of data input to a task.
Extendable Programming interface, that can orchestrate distributed jobs or run locally -
e.g spark, MPI, sagemaker
Reference task for library isolation
Scheduler independent of user code
Understanding of resources like GPU's etc - Automatically schedule on gpus and or spot machines. With Smart handling of spot machines - n-1 retries last one automatically is moved to an on-demand machine to better guarantees
Map tasks and dynamic tasks. (map over a list of regions), dynamic -> create new static graphs based on inputs dyanmically
Multiple launchplans. Schedule 2 runs for a workflow with slightly different hyper parameters or model values etc
For Admins
For really long running tasks, admin can deploy the management layer without killing the tasks
Support for spot/arm/gpu (with different versions etc)
Quotas and throttles for per project/domain
Upgrade infra without upgrading user libraries
Integrations
pandas dataframe native support
Spark
mpi jobs (gang scheduled)
pandera / Great expectations for data quality
Sagemaker
Easy deployment of model for serving
Polars / Modin / Spark dataframe
tensors / checkpointing etc
etc and many others in the roadmap
Community
Focused on ML specific features
Roadmap
CD4ML, with human in the loop and external signal based workflows. This will allow for users to automate deployment of models or perform human in the loop labeling etc
Support for Ray/Spark/Dask cluster re-use across tasks
Integration with WhyLogs and other tools for monitoring
Integration with MLFlow etc
More native Flytedecks renderers
Hopefully this answers your questions. Also please join the slack community and help spread this information. Also ask more questions

Generic Airflow data staging operator

I am trying to understand how to manage large data with Airflow. The documentation is clear that we need to use external storage, rather than XCom, but I can not find any clean examples of staging data to and from a worker node.
My expectation would be that there should be an operator that can run a staging in operation, run the main operation, and staging out again.
Is there such a Operator or pattern? The closes I've found is an S3 File Transform but it runs an executable to do the transform, rather than a generic Operator, such as a DockerOperator which we'd want to use.
Other "solutions" I've seen rely on everything running on a single host, and using known paths, which isn't a production ready solution.
Is there such an operator that would support data staging, or is there a concrete example of handling large data with Airflow that doesn't rely on each operator being equipped with cloud coping capabilities?
Yes and no. Traditionally, Airflow is mostly orchestrator - so it does not usually "do" the stuff, it usually tells others what to do. You very rarely need to bring actual data to Airflow worker, Worker is mostly there to tell others where the data is coming from, what to do with it and where to send it.
There are exceptions (some transfer operators actually download data from one service and upload it to another) - so the data passes through Airflow node, but this is an exception rather than a rule (the more efficient and better pattern is to invoke an external service to do the transfer and have a sensor to wait until it completes).
This is more of "historical" and somewhat "current" way how Airflow operates, however with Airflow 2 and beyond we are expandingh this and it becomes more and more possible to do a pattern similar to what you describe, and this is where XCom play a big role there.
You can - as of recently - develop Custom XCom Backends that allow for more than meta-data sharing - they are also good for sharing the data. You can see docs here https://airflow.apache.org/docs/apache-airflow/stable/concepts/xcoms.html#custom-backends but also you have this nice article from Astronomer about it https://www.astronomer.io/guides/custom-xcom-backends and a very nice talk from Airflow Summit 2021 (from last week) presenting that: https://airflowsummit.org/sessions/2021/customizing-xcom-to-enhance-data-sharing-between-tasks/ . I Highly Recommend to watch the talk!
Looking at your pattern - XCom Pull is staging-in, Operator's execute() is operation and XCom Push is staging-out.
This pattern will be reinforced, I think by upcoming Airflow releases and some cool integrations that are coming. And there will be likely more cool data sharing options in the future (but I think they will all be based on - maybe slightly enhanced - XCom implementation).

Snowpipe vs Airflow for Continues data loading into Snowflake

I had a question related to Snowflake. Actually in my current role, I am planning to migrate data from ADLS (Azure data lake) to Snowflake.
I am right now looking for 2 options
Creating Snowpipe to load updated data
Create Airflow job for same.
I am still trying to understand which will be the best way and what is the pro and cons of choosing each.
It depends on what you are trying to as part of this migration. If it is a plain vanilla(no transformation, no complex validations) as-is migration of data from ADLS to Snowflake, then you may be good with SnowPipe(but please also check if your scenario is good for Snowpipe or Bulk Copy- https://docs.snowflake.com/en/user-guide/data-load-snowpipe-intro.html#recommended-load-file-size).
If you have many steps before you move the data to snowflake and there are chances that you may need to change your workflow in future, it is better to use Airflow which will give you more flexibility. In one of my migrations, I used Airflow and in the other one CONTROL-M
You'll be able to load higher volumes of data with lower latency if you use Snowpipe instead of Airflow. It'll also be easier to manage Snowpipe in my opinion.
Airflow is a batch scheduler and using it to schedule anything that runs more frequently than 5 minutes becomes painful to manage. Also, you'll have to manage the scaling yourself with Airflow. Snowpipe is a serverless option that can scale up and down based on the volumes sees and you're going to see your data land within 2 minutes.
The only thing that should restrict your usage of Snowpipe is cost. Although, you may find that Snowpipe ends up being cheaper in the long run if you consider that you'll need someone to manage your Airflow pipelines too.
There are a few considerations. Snowpipe can only run a single copy command, which has some limitations itself, and snowpipe imposes further limitations as per Usage Notes. The main pain is that it does not support PURGE = TRUE | FALSE (i.e. automatic purging while loading) saying:
Note that you can manually remove files from an internal (i.e.
Snowflake) stage (after they’ve been loaded) using the REMOVE command.
Regrettably the snowflake docs are famously vague as they use an ambiguous colloquial writing style. While it said you 'can' remove the files manually yourself in reality any user using snowpipe as advertised for "continuous fast ingestion" must remove the files to not suffer performance/cost impacts of the copy command having to ignore a very large number of files that have been previously loaded. The docs around the cost and performance of "table directories" which are implicit to stages talk about 1m files being a lot of files. By way of an official example the default pipe flush time on snowflake kafka connector snowpipe is 120s so assuming data ingests continually, and you make one file per flush, you will hit 1m files in 2 years. Yet using snowpipe is supposed to imply low latency. If you were to lower the flush to 30s you may hit the 1m file mark in about half a year.
If you want a fully automated process with no manual intervention this could mean that after you have pushed files into a stage and invoked the pipe you need logic have to poll the API to learn which files were eventually loaded. Your logic can then remove the loaded files. The official snowpipe Java example code has some logic that pushes files then polls the API to check when the files are eventually loaded. The snowflake kafka connector also polls to check which files the pipe has eventually asynchronously completed. Alternatively, you might write an airflow job to ls #the_stage and look for files last_modified that is in the past greater than some safe threshold to then rm #the_stage/path/file.gz the older files.
The next limitation is that a copy command is a "copy into your_table" command that can only target a single table. You can however do advanced transformations using SQL in the copy command.
Another thing to consider is that neither latency nor throughput is guaranteed with snowpipe. The documentation very clearly says you should measure the latency yourself. It would be a completely "free lunch" if snowpipe that is running on shared infrastructure to reduce your costs were to run instantly and as fast if you were paying for hot warehouses. It is reasonable to assume a higher tail latency when using shared "on-demand" infrastructure (i.e. a low percentage of invocations that have a high delay).
You have no control over the size of the warehouse used by snowpipe. This will affect the performance of any sql transforms used in the copy command. In contrast if you run on Airflow you have to assign a warehouse to run the copy command and you can assign as big a warehouse as you need to run your transforms.
A final consideration is that to use snowpipe you need to make a Snowflake API call. That is significantly more complex code to write than making a regular database connection to load data into a stage. For example, the regular Snowflake JDBC database connection has advanced methods to make it efficient to stream data into stages without having to write oAuth code to call the snowflake API.
Be very clear that if you carefully read the snowpipe documentation you will see that snowpipe is simply a restricted copy into table command running on shared infrastructure that is eventually run at some point; yet you yourself can run a full copy command as part of a more complex SQL script on a warehouse that you can size and suspend. If you can live with the restrictions of snowpipe, can figure out how to remove the files in the stage yourself, and you can live with the fact that tail latency and throughput is likely to be higher than paying for a dedicated warehouse, then it could be a good fit.

Scheduling thousands of tasks with Airflow

We are considering to use Airflow for a project that needs to do thousands of calls a day to external APIs in order to download external data, where each call might take many minutes.
One option we are considering is to create a task for each distinct API call, however this will lead to thousands of tasks. Rendering all those tasks in UI is going to be challenging. We are also worried about the scheduler, which may struggle with so many tasks.
Other option is to have just a few parallel long-running tasks and then implement our own scheduler within those tasks. We can add a custom code into PythonOperator, which will query the database and will decide which API to call next.
Perhaps Airflow is not well suited for such a use case and it would be easier and better to implement such a system outside of Airflow? Does anyone have experience with running thousands of tasks in Airflow and can shed some light on pros and cons on the above use case?
One task per call would kill Airflow as it still needs to check on the status of each task at every heartbeat - even if the processing of the task (worker) is separate e.g. on K8s.
Not sure where you plan on running Airflow but if on GCP and a download is not longer than 9 min, you could use the following:
task (PythonOperator) -> pubsub -> cloud function (to retrieve) -> pubsub -> function (to save result to backend).
The latter function may not be required but we (re)use a generic and simple "bigquery streamer".
Finally, you query in a downstream AF task (PythonSensor) the number of results in the backend and compare with the number of requests published.
We do this quite efficiently for 100K API calls to a third-party system we host on GCP as we maximize parallelism. The nice thing of GCF is that you can tweak the architecture to use and concurrency, instead of provisioning a VM or container to run the tasks.

Airflow SubDagOperator deadlock

I'm running into a problem where a DAG composed of several SubDagOperators hangs indefinitely.
The setup:
Using CeleryExecutor. For the purposes of this example, let's say we have one worker which can run five tasks concurrently.
The DAG I'm running into problems with runs several SubDagOperators in parallel. For illustration purposes, consider the following graph, where every node is a SubDagOperator:
The problem: The DAG will stop making progress in the high-parallelism part of the DAG. The root cause seems to be that the top-level SubDagOperators take up all five of the slots available for running tasks, so none of the sub-tasks inside of those SubDagOperators are able to run. Those subtasks get stuck in the queued state and no one makes progress.
It was a bit surprising to me that the SubDagOperator tasks would compete with their own subtasks for task running slots, but it makes sense to me now. Are there best practices around writing SubDagOperators that I am violating?
My plan is to work around this problem by creating a custom operator to encapsulate the tasks that are currently encapsulated inside the SubDagOperators. I was wondering if anyone had advice on whether it was advisable to create an operator composed of other operators?
It does seem like SubDagOperator should be avoided because it causes this deadlock issue. I ultimately found that for my use case, I was best served by writing my own custom BaseOperator subclass to do the tasks I was doing inside SubDagOperator. Writing the operator class was much easier than I expected.

Resources