Airflow versus AWS Step Functions for workflow - airflow

I am working on a project that grabs a set of input data from AWS S3, pre-processes and divvies it up, spins up 10K batch containers to process the divvied data in parallel on AWS Batch, post-aggregates the data, and pushes it to S3.
I already have software patterns from other projects for Airflow + Batch, but have not dealt with the scaling factors of 10k parallel tasks. Airflow is nice since I can look at which tasks failed and retry a task after debugging. But dealing with that many tasks on one Airflow EC2 instance seems like a barrier. Other option would be to have one task that kicks off the 10k containers and monitors it from there.
I have no experience with Step Functions, but have heard it's AWS's Airflow. There looks to be plenty of patterns online for Step Functions + Batch. Does Step Functions seem like a good path to check out for my use case? Do you get the same insights on failing jobs / ability to retry tasks as you do with Airflow?

I have worked on both Apache Airflow and AWS Step Functions and here are some insights:
Step Functions provide out of the box maintenance. It has high availability and scalability that is required for your use-case, for Airflow we'll have to do to it with auto-scaling/load balancing on servers or containers (kubernetes).*
Both Airflow and Step Functions have user friendly UI's. While Airflow supports multiple representations of the state machine, Step Functions only display state machine as DAG's.
As of version 2.0, Airflow's Rest API is now stable. AWS Step Functions are also supported by a range of production graded cli and SDK's.
Airflow has server costs while Step Functions have 4000/month free step executions (free tier) and $0.000025/step after that. e.g. if you use 10K steps for AWS Batch that run once daily, you will be priced $0.25 per day ($7.5 per month). The price for Airflow server (t2.large ec2 1 year reserved instance) is $41.98 per month. We will have to use AWS Batch for either case.**
AWS Batch can integrate to both Airflow and Step Functions.
You can clear and rerun a failed task in Apache Airflow, but in Step Functions you will have to create a custom implementation to handle that. You may handle automated retries with back-offs in Step Functions definition as well.
For failed task in Step Functions you will get a visual representation of failed state and the detailed message when you click it. You may also use aws cli or sdk to get the details.
Step Functions use easy to use JSON as state machine definition, while Airflow uses Python script.
Step Functions support async callbacks, i.e. state machine pauses until an external source notifies it to resume. While Airflow has yet to add this feature.
Overall, I see more advantages of using AWS Step Functions. You will have to consider maintenance cost and development cost for both services as per your use case.
UPDATES (AWS Managed Workflows for Apache Airflow Service):
*With AWS Managed Workflows for Apache Airflow service, you can offload deployment, maintenance, autoscaling/load balancing and security of your Airflow Service to AWS. But please consider the version number you're willing to settle for, as AWS managed services are mostly behind the latest version. (e.g. As of March 08, 2021, the latest version of open source airflow is 2.01, while MWAA allows version 1.10.12)
**MWAA costs on environment, instance and storage. More details here.

I have used both Airflow and Step Functions in my personal and work projects.
In general I liked step functions but the fact that you need to schedule the execution with Event Bridge is super annoying. Actually I think here Airflow could just act as a triggered for the step functions.
If Airflow would be cheaper to manage, I would always opt for it because I find managing Json based pipelines a hustle whenever you need to detour from the main use case. This always happen for me somehow.This becomes even a more complex issue when you need to have source control.
This one is a bit more subjective assessment but I find the monitoring capability of Airflow far greater than for step functions.
Also some information about the usage of Airflow vs Step functions

Aws currently has managed airflow which is priced per hour and you don’t need to have dedicated ec2. On the other hand step functions are aws lambdas that have an execution time limit of 15min which makes them not the best candidate for a long running pipelines

Related

How can I reduce costs of ECS Fargate being used to run an R ShinyApp

I am running an R ShinyApp on Fargate ECS. It is roughly used once per week by the customer. It is running constantly and therefore we are paying for a substantial amount of idle time.
Is there a way it could be launched when there is an incoming connection and then stopped when this connection ends?
Does anyone have any suggestions for this?
Many thanks
You need a serverless style of application hosting, e.g. as suggested by a commentor with API Gateway backed by Lambda. If your request count is low, you may actually not pay much due to Free tier for these services. There is an R Runtime for Lambda here:
[1] Serverless execution of R code on AWS Lambda - https://github.com/bakdata/aws-lambda-r-runtime

Differences between matillion and apache airflow

I want to use an ETL service, but i am stuck between Apache Airflow and Matillion.
Are they the same?
What are the main differences?
Airflow's primary use-case is orchestration / scheduling, not ETL. You can perform ETL tasks inside Airflow DAGs, but unless you're planning on implementing Airflow using a containerized / K8 architecture, you'll quickly see performance bottlenecks and even hung / stuck processes. There are ways to mitigate this, certainly, but it's not the primary use case.
Matillion's primary use-case is ETL (really ELT), so it's not going to suffer the same performance issues, or require a complex infrastructure to achieve that performance. It also provides a GUI based code-optional interface, so that you don't have to be a Python expert to achieve results quickly.
I actually view Airflow and Matillion as complimentary (potentially). If you have inter-application dependencies, for example, you can orchestrate Matillion workflow with Airflow, or another third-party scheduler, and gain the benefits of both.
I've never used Matillion. So I can't answer with respect to any specific use case you have.
But with the quick analysis on Matillion I can very well tell that Matillion and Airflow aren't the same at all.
Matillion is a Extract/Transform/Load tool. You can compare it with tools like AWS Glue / Apache NiFi / DMExpress.
Airflow is an orchestration tool. You can compare it with tools like oozie.
More importantly Matillion doesn't come free of cost.

Sharing large intermediate state between Airflow tasks

We have an Airflow deployment with Celery executors.
Many of our DAGs require a local processing step of some file in a BashOperator or PythonOperator.
However, in our understanding the tasks of a given DAG may not always be scheduled on the same machine.
The options for state sharing between tasks I've gathered so far:
Use Local Executors - this may suffice for one team, depending on the load, but may not scale to the wider company
Use XCom - does this have a size limit? Probably unsuitable for large files
Write custom Operators for every combination of tasks that need local processing in between. This approach reduces modularity of tasks and requires replicating existing operators' code.
Use Celery queues to route DAGs to the same worker (docs) - This option seems attractive at first, but what would be an appropriate way to set it up in order to avoid routing everything to one executor, or crafting a million queues?
Use a shared network storage in all machines that run executors - Seems like an additional infrastructure burden, but is a possibility.
What is the recommended way to do sharing of large intermediate state, such as files, between tasks in Airflow?
To clarify something: No matter how you setup airflow, there will only be one executor running.
The executor runs on the same machine as the scheduler.
Currently (current is airflow 1.9.0 at time of writing) there is no safe way to run multiple schedulers, so there will only ever be one executor running.
Local executor executes the task on the same machine as the scheduler.
Celery Executor just puts tasks in a queue to be worked on the celery workers.
However, the question you are asking does apply to Celery workers. If you use Celery Executor you will probably have multiple celery workers.
Using network shared storage solves multiple problems:
Each worker machine sees the same dags because they have the same dags folder
Results of operators can be stored on a shared file system
The scheduler and webserver can also share the dags folder and run on different machines
I would use network storage, and write the output file name to xcom. Then when you need to input the output from a previous task, you would read the file name from that task's Xcom and process that file.
Change datatype of column key in xcom table of airflow metastore.
Default datatype of key is: blob.
Change it to LONGBLOB. It will help you to store upto 4GB in between intermediate tasks.

AWS Batch executor with Airflow

I'm currently using airflow on Amazon Web services using EC2 instances. The big issue is that the average usage of the instances are about 2%...
I'd like to use a scalable architecture and creating instances only for the duration of the job and kill it. I saw on the roadmap that AWS BATCH was suppose to be an executor in 2017 but no new about that.
Do you know if it possible to use AWS BATCH as an executor for all airflow jobs ?
Regards,
Romain.
There is no executor, but an operator is available from version 1.10. After you create an Execution Environment, Job Queue and Job Definition on AWS Batch, you can use the AWSBatchOperator to trigger Jobs.
Here is the source code.
Currently there is a SequentialExecutor, a LocalExecutor, a DaskExecutor, a CeleryExecutor and a MesosExecutor. I heard they're working on AIRFLOW-1899 targeted for 2.0 to introduce a KubernetesExecutor. So, looking at Dask and Celery it doesn't seem they support a mode where their workers are created per task. Mesos might, Kubernetes should, but then you'd have to scale the clusters for the workers accordingly to account for turning off the nodes when un-needed.
We did a little work to get a cloud formation setup where celery workers scale out and in based on metrics from cloud-watch of the average cpu load across the tagged workers.
You would need to create a custom Executor (extended from BaseExecutor) capable of submitting and monitoring the AWS Batch jobs. Also may need to create a custom Docker image for the instances.
I found this repository in my case is working quite well https://github.com/aelzeiny/airflow-aws-executors I'm using Batch jobs with FARGATE_SPOT with computation engine.
I'm just struggling with the logging on AWS CloudWatch and the return status in AWS batch but from Airflow perspective is working

Using airflow for real time job orchestration

I have an application that runs as a web service, which submits jobs to Spark on a user request. A job queue needs to be limited per user. I am planning to use Airflow as an orchestration framework to manage job queues but while it supports parallel DAG execution it's optimized for batch processing rather than real time. Is Airflow designed to handle ~200 DAG executions per second with multiple queues (one per user) or should I look for alternatives?
Do you have data move from one task to another? Does time matter here since you mentioned real-time. With Airflow, workflows are expected to be mostly static or slowly changing. Mostly for ETL batch processing, you can speed up the airflow heartbeat, but would be good to have a POC with your use case to test out.
Below is from Airflow official document: https://airflow.apache.org/#beyond-the-horizon
Airflow is not a data streaming solution. Tasks do not move data from
one to the other (though tasks can exchange metadata!). Airflow is not
in the Spark Streaming or Storm space, it is more comparable to Oozie
or Azkaban

Resources