Generic Airflow data staging operator - airflow

I am trying to understand how to manage large data with Airflow. The documentation is clear that we need to use external storage, rather than XCom, but I can not find any clean examples of staging data to and from a worker node.
My expectation would be that there should be an operator that can run a staging in operation, run the main operation, and staging out again.
Is there such a Operator or pattern? The closes I've found is an S3 File Transform but it runs an executable to do the transform, rather than a generic Operator, such as a DockerOperator which we'd want to use.
Other "solutions" I've seen rely on everything running on a single host, and using known paths, which isn't a production ready solution.
Is there such an operator that would support data staging, or is there a concrete example of handling large data with Airflow that doesn't rely on each operator being equipped with cloud coping capabilities?

Yes and no. Traditionally, Airflow is mostly orchestrator - so it does not usually "do" the stuff, it usually tells others what to do. You very rarely need to bring actual data to Airflow worker, Worker is mostly there to tell others where the data is coming from, what to do with it and where to send it.
There are exceptions (some transfer operators actually download data from one service and upload it to another) - so the data passes through Airflow node, but this is an exception rather than a rule (the more efficient and better pattern is to invoke an external service to do the transfer and have a sensor to wait until it completes).
This is more of "historical" and somewhat "current" way how Airflow operates, however with Airflow 2 and beyond we are expandingh this and it becomes more and more possible to do a pattern similar to what you describe, and this is where XCom play a big role there.
You can - as of recently - develop Custom XCom Backends that allow for more than meta-data sharing - they are also good for sharing the data. You can see docs here https://airflow.apache.org/docs/apache-airflow/stable/concepts/xcoms.html#custom-backends but also you have this nice article from Astronomer about it https://www.astronomer.io/guides/custom-xcom-backends and a very nice talk from Airflow Summit 2021 (from last week) presenting that: https://airflowsummit.org/sessions/2021/customizing-xcom-to-enhance-data-sharing-between-tasks/ . I Highly Recommend to watch the talk!
Looking at your pattern - XCom Pull is staging-in, Operator's execute() is operation and XCom Push is staging-out.
This pattern will be reinforced, I think by upcoming Airflow releases and some cool integrations that are coming. And there will be likely more cool data sharing options in the future (but I think they will all be based on - maybe slightly enhanced - XCom implementation).

Related

Best way to send periodic spreadsheet-style reports from a Redshift DWH?

Ello M8's,
Currently in charge of an airflow DAG that sends periodic reports directly to people's emails as a custom CSV. Also sometimes to external S3.
The DAG implementation I'm working with is great for what it already does, but its difficult to extend and scale. In the case of a refactor, I'm wondering what the proper tool/method is to accomplish automated reports from a redshift DWH. Any tips? Kind of wish AWS had a reporting service on top of redshift, but maybe airflow is their answer to that in the first place.

airflow create sub process based on number of files

A newbie question in airflow,
I am having a list of 100+ servers in a text file. Currently, a python script is used to login to each server, read a file, and write the output. It's taking a long time to get the output. If this job is converted to Airflow DAG, is it possible to split the servers into multiple groups and a new task can be initiated by using any operators? Or this can be achieved by only modifying the Python script(like using async) and execute using the Python operator. Seeking advice/best practice. I tried searching for examples but was not able to find one. Thanks!
Airflow is not really a "map-reduce" type of framework (which you seem to be trying to implement). The tasks of Airflow are not (at least currently) designed to split the work between them. This is very atypical for Airflow to have N tasks that do the same thing on a subset of data each. Airflow is more for orchestrating the logic, so each task in Airflow conceptually does a different thing and there are rarely cases where N parallel task do the same thing (but on a different subset of data). More often than not Airflow "tasks" do not "do" the job themselves, they are rather telling others what to do and wait until this gets done.
Typically Airflow can be used to orchestrate such services which excel in doing this kind of processing - you could have a Hadoop job which processes such "parallel" map-reduce kind of jobs using other tools. You could also - as you mentioned - perform an async, multi-threading or even multi-processing python operator, but at some scale, I think typically other, dedicated tools should be much easier to use and better to get the most value of (with efficient utilization of parallelism for example).

Snowpipe vs Airflow for Continues data loading into Snowflake

I had a question related to Snowflake. Actually in my current role, I am planning to migrate data from ADLS (Azure data lake) to Snowflake.
I am right now looking for 2 options
Creating Snowpipe to load updated data
Create Airflow job for same.
I am still trying to understand which will be the best way and what is the pro and cons of choosing each.
It depends on what you are trying to as part of this migration. If it is a plain vanilla(no transformation, no complex validations) as-is migration of data from ADLS to Snowflake, then you may be good with SnowPipe(but please also check if your scenario is good for Snowpipe or Bulk Copy- https://docs.snowflake.com/en/user-guide/data-load-snowpipe-intro.html#recommended-load-file-size).
If you have many steps before you move the data to snowflake and there are chances that you may need to change your workflow in future, it is better to use Airflow which will give you more flexibility. In one of my migrations, I used Airflow and in the other one CONTROL-M
You'll be able to load higher volumes of data with lower latency if you use Snowpipe instead of Airflow. It'll also be easier to manage Snowpipe in my opinion.
Airflow is a batch scheduler and using it to schedule anything that runs more frequently than 5 minutes becomes painful to manage. Also, you'll have to manage the scaling yourself with Airflow. Snowpipe is a serverless option that can scale up and down based on the volumes sees and you're going to see your data land within 2 minutes.
The only thing that should restrict your usage of Snowpipe is cost. Although, you may find that Snowpipe ends up being cheaper in the long run if you consider that you'll need someone to manage your Airflow pipelines too.
There are a few considerations. Snowpipe can only run a single copy command, which has some limitations itself, and snowpipe imposes further limitations as per Usage Notes. The main pain is that it does not support PURGE = TRUE | FALSE (i.e. automatic purging while loading) saying:
Note that you can manually remove files from an internal (i.e.
Snowflake) stage (after they’ve been loaded) using the REMOVE command.
Regrettably the snowflake docs are famously vague as they use an ambiguous colloquial writing style. While it said you 'can' remove the files manually yourself in reality any user using snowpipe as advertised for "continuous fast ingestion" must remove the files to not suffer performance/cost impacts of the copy command having to ignore a very large number of files that have been previously loaded. The docs around the cost and performance of "table directories" which are implicit to stages talk about 1m files being a lot of files. By way of an official example the default pipe flush time on snowflake kafka connector snowpipe is 120s so assuming data ingests continually, and you make one file per flush, you will hit 1m files in 2 years. Yet using snowpipe is supposed to imply low latency. If you were to lower the flush to 30s you may hit the 1m file mark in about half a year.
If you want a fully automated process with no manual intervention this could mean that after you have pushed files into a stage and invoked the pipe you need logic have to poll the API to learn which files were eventually loaded. Your logic can then remove the loaded files. The official snowpipe Java example code has some logic that pushes files then polls the API to check when the files are eventually loaded. The snowflake kafka connector also polls to check which files the pipe has eventually asynchronously completed. Alternatively, you might write an airflow job to ls #the_stage and look for files last_modified that is in the past greater than some safe threshold to then rm #the_stage/path/file.gz the older files.
The next limitation is that a copy command is a "copy into your_table" command that can only target a single table. You can however do advanced transformations using SQL in the copy command.
Another thing to consider is that neither latency nor throughput is guaranteed with snowpipe. The documentation very clearly says you should measure the latency yourself. It would be a completely "free lunch" if snowpipe that is running on shared infrastructure to reduce your costs were to run instantly and as fast if you were paying for hot warehouses. It is reasonable to assume a higher tail latency when using shared "on-demand" infrastructure (i.e. a low percentage of invocations that have a high delay).
You have no control over the size of the warehouse used by snowpipe. This will affect the performance of any sql transforms used in the copy command. In contrast if you run on Airflow you have to assign a warehouse to run the copy command and you can assign as big a warehouse as you need to run your transforms.
A final consideration is that to use snowpipe you need to make a Snowflake API call. That is significantly more complex code to write than making a regular database connection to load data into a stage. For example, the regular Snowflake JDBC database connection has advanced methods to make it efficient to stream data into stages without having to write oAuth code to call the snowflake API.
Be very clear that if you carefully read the snowpipe documentation you will see that snowpipe is simply a restricted copy into table command running on shared infrastructure that is eventually run at some point; yet you yourself can run a full copy command as part of a more complex SQL script on a warehouse that you can size and suspend. If you can live with the restrictions of snowpipe, can figure out how to remove the files in the stage yourself, and you can live with the fact that tail latency and throughput is likely to be higher than paying for a dedicated warehouse, then it could be a good fit.

Scheduling thousands of tasks with Airflow

We are considering to use Airflow for a project that needs to do thousands of calls a day to external APIs in order to download external data, where each call might take many minutes.
One option we are considering is to create a task for each distinct API call, however this will lead to thousands of tasks. Rendering all those tasks in UI is going to be challenging. We are also worried about the scheduler, which may struggle with so many tasks.
Other option is to have just a few parallel long-running tasks and then implement our own scheduler within those tasks. We can add a custom code into PythonOperator, which will query the database and will decide which API to call next.
Perhaps Airflow is not well suited for such a use case and it would be easier and better to implement such a system outside of Airflow? Does anyone have experience with running thousands of tasks in Airflow and can shed some light on pros and cons on the above use case?
One task per call would kill Airflow as it still needs to check on the status of each task at every heartbeat - even if the processing of the task (worker) is separate e.g. on K8s.
Not sure where you plan on running Airflow but if on GCP and a download is not longer than 9 min, you could use the following:
task (PythonOperator) -> pubsub -> cloud function (to retrieve) -> pubsub -> function (to save result to backend).
The latter function may not be required but we (re)use a generic and simple "bigquery streamer".
Finally, you query in a downstream AF task (PythonSensor) the number of results in the backend and compare with the number of requests published.
We do this quite efficiently for 100K API calls to a third-party system we host on GCP as we maximize parallelism. The nice thing of GCF is that you can tweak the architecture to use and concurrency, instead of provisioning a VM or container to run the tasks.

Event Driven Architecture - Service Contract Design

I'm having difficulty conceptualising a requirement I have into something that will fit into our nascent SOA/EDA
We have a component I'll call the Data Downloader. This is a facade for an external data provider that has both high latency and a cost associated with every request. I want to take this component and create a re-usable service out of it with a clear contract definition. It is up to me to decide how that contract should work, however its responsibilities are two-fold:
Maintain the parameter list (called a Download Definition) for an upcoming scheduled download
Manage the technical details of the communication to the external service
Basically, it manages the 'how' of the communication. The 'what' and the 'when' are the responsibilities of two other components:
The 'what' is managed by 'Clients' who are responsible for
determining the parameters for the download.
The 'when' is managed by a dedicated scheduling component. Because of the cost associated with the downloads we'd like to batch the requests intraday.
Hopefully this sequence diagram explains the responsibilities of the services:
Because each of the responsibilities are split out in three different components, we get all sorts of potential race conditions with async messaging. For instance when the Scheduler tells the Downloader to do its work, because the 'Append to Download Definition' command is asynchronous, there is no guarantee that the pending requests from Client A have actually been serviced. But this all screams high-coupling to me; why should the Scheduler necessarily know about any 'prerequisite' client requests that need to have been actioned before it can invoke a download?
Some potential solutions we've toyed with:
Make the 'Append to Download Definition' command a blocking request/response operation. But this then breaks the perf. and scalability benefits of having an EDA
Build something in the Downloader to ensure that it only runs when there are no pending commands in its incoming request queue. But that then introduces a dependency on the underlying messaging infrastructure which I don't like either.
Makes me think I'm thinking about this problem in a completely backward way. Or is this just a classic case of someone trying to fit a synchronous RPC requirement into an async event-driven architecture?
The thing I like most about EDA and SOA, is that it almost completely eliminates the notion of race condition. As long as your events are associated with some association key (e.g. downloadId), the problem you describe can be addressed with several solutions of different complexities - depending on your needs. I'm not sure I totally understand the described use-case but I will try my best
Out of the top of my head:
DataDownloader maintains a list of received Download Definitions and a list of triggered downloads. When a definition is received it is checked against the triggers list to see if the associated download has already been triggered, and if it was, execute the download. When a TriggerDownloadCommand is recieved, the definitions list is checked against a definition with the associated downloadId.
For more complex situation, consider using the Saga pattern, which is implemented by some 3rd party messaging infrastructures. With some simple configuration, it will handle both messages, and initiate the actual download when the required condition is satisfied. This is more appropriate for distributed systems, where an in-memory collection is out of the question.
You can also configure your scheduler (or the trigger command handler) to retry when an error is signaled (e.g. by an exception), in order to avoid that race condition, and ultimately give up after a specified timeout.
Does this help?

Resources