I continue to read that Airflow is meant to be an orchestration engine and not an execution engine. But what does that mean?
I know that Kubernetes is also called an orchestration tool. So is the idea that Airflow is simply meant to define high-level dependency graphs but not any of the code to actually perform said tasks? I am still really struggling to understand the paradigm that Airflow is operating with.
Related
We started to implement Airflow for task scheduling about a year ago, and we are slowly migrating more and more tasks to Airflow. At some point we noticed that the server was filling up with logs, even after we implemented remote logging to S3. I'm trying to understand what the best way to handle logs is, and I've found a lot of conflicting advice, such as in this stackoverflow question from 4 years ago.
Implementing maintenance dags to clean out logs (airflow-maintenance-dags)
Implementing our own FileTaskHandler
Using the logrotate linux utility
When we implemented remote logging, we expected the local logs to be removed after they were shipped to S3, but this is not the case. Local logs remain on the server. I thought this might be a problem with our configuration but I haven't found any way to fix that. Also, remote logging only applies to task logs, but process logs (specifically scheduler logs) are always local, and they took up the most space.
We tried to implement maintenance dags, but our workers are running from a different location to the rest of airflow, particularly the scheduler, so only task logs were getting cleaned. We could get around this by creating a new worker that shares logs with the scheduler, but we prefer not to create extra workers.
We haven't tried to implement either of the other two suggestions yet. But that is why I want to understand, how are other people solving this, and what is the recommended way?
I am working on a project that grabs a set of input data from AWS S3, pre-processes and divvies it up, spins up 10K batch containers to process the divvied data in parallel on AWS Batch, post-aggregates the data, and pushes it to S3.
I already have software patterns from other projects for Airflow + Batch, but have not dealt with the scaling factors of 10k parallel tasks. Airflow is nice since I can look at which tasks failed and retry a task after debugging. But dealing with that many tasks on one Airflow EC2 instance seems like a barrier. Other option would be to have one task that kicks off the 10k containers and monitors it from there.
I have no experience with Step Functions, but have heard it's AWS's Airflow. There looks to be plenty of patterns online for Step Functions + Batch. Does Step Functions seem like a good path to check out for my use case? Do you get the same insights on failing jobs / ability to retry tasks as you do with Airflow?
I have worked on both Apache Airflow and AWS Step Functions and here are some insights:
Step Functions provide out of the box maintenance. It has high availability and scalability that is required for your use-case, for Airflow we'll have to do to it with auto-scaling/load balancing on servers or containers (kubernetes).*
Both Airflow and Step Functions have user friendly UI's. While Airflow supports multiple representations of the state machine, Step Functions only display state machine as DAG's.
As of version 2.0, Airflow's Rest API is now stable. AWS Step Functions are also supported by a range of production graded cli and SDK's.
Airflow has server costs while Step Functions have 4000/month free step executions (free tier) and $0.000025/step after that. e.g. if you use 10K steps for AWS Batch that run once daily, you will be priced $0.25 per day ($7.5 per month). The price for Airflow server (t2.large ec2 1 year reserved instance) is $41.98 per month. We will have to use AWS Batch for either case.**
AWS Batch can integrate to both Airflow and Step Functions.
You can clear and rerun a failed task in Apache Airflow, but in Step Functions you will have to create a custom implementation to handle that. You may handle automated retries with back-offs in Step Functions definition as well.
For failed task in Step Functions you will get a visual representation of failed state and the detailed message when you click it. You may also use aws cli or sdk to get the details.
Step Functions use easy to use JSON as state machine definition, while Airflow uses Python script.
Step Functions support async callbacks, i.e. state machine pauses until an external source notifies it to resume. While Airflow has yet to add this feature.
Overall, I see more advantages of using AWS Step Functions. You will have to consider maintenance cost and development cost for both services as per your use case.
UPDATES (AWS Managed Workflows for Apache Airflow Service):
*With AWS Managed Workflows for Apache Airflow service, you can offload deployment, maintenance, autoscaling/load balancing and security of your Airflow Service to AWS. But please consider the version number you're willing to settle for, as AWS managed services are mostly behind the latest version. (e.g. As of March 08, 2021, the latest version of open source airflow is 2.01, while MWAA allows version 1.10.12)
**MWAA costs on environment, instance and storage. More details here.
I have used both Airflow and Step Functions in my personal and work projects.
In general I liked step functions but the fact that you need to schedule the execution with Event Bridge is super annoying. Actually I think here Airflow could just act as a triggered for the step functions.
If Airflow would be cheaper to manage, I would always opt for it because I find managing Json based pipelines a hustle whenever you need to detour from the main use case. This always happen for me somehow.This becomes even a more complex issue when you need to have source control.
This one is a bit more subjective assessment but I find the monitoring capability of Airflow far greater than for step functions.
Also some information about the usage of Airflow vs Step functions
Aws currently has managed airflow which is priced per hour and you don’t need to have dedicated ec2. On the other hand step functions are aws lambdas that have an execution time limit of 15min which makes them not the best candidate for a long running pipelines
I am designing a CorDapp, which would require user input as well as API integration, and I am considering various approaches to expose flows and vault queries to the outside world.
Default option seems to be to use Corda RPC. Unless I missed something, there are only Java bindings for it, which is effectively restricting the clients to only be JVM-based. This is somewhat inconvenient, and ideally I would like something like OpenAPI to make it more open and implementation-agnostic.
Another option is to use some kind of Corda RPC to OpenAPI proxy. I know about Braid, and I'm sure there are others. Braid seems to support deployment as a Corda service packed together with the flows into the CorDapp itself, effectively making it running embedded into the Corda JVM.
Braid can be deployed as a standalone proxy too, which I suppose is option three.
Instinctively I find the embedded mode more attractive, as it reduces the number of moving parts, as opposed to a standalone mode. However, I am concerned that such model may be in fact become discouraged at some point, either because Corda developers consider it to be a misuse of services facility, or because some organisations will not be keen to deploy such code onto their nodes, especially when they may be running multiple CorDapps. I would imagine anything deployed as part of Corda JVM would at least require more scrutiny due to potential impact on other things running there, which in turn would reduce the agility.
I wonder what approach to integrate with a CorDapp is actually recommended?
Edit 1: I know it is technically possible to embed a webserver into the node and expose a REST API from there, at least in the current version of Corda (4.3 at the time of writing). The question is more about whether it is a good idea to do so, or not, and why.
Take a look at the question I had asked on Stackoverflow regarding front end for CordApp. This might be of some help.
Following is the link -
"Corda: Can we develop Dapps that will be run by IIS webserver to talk to Corda platform?"
You can use any front-end technology you want.
As of Corda 3, your backend must be JVM-based, for two reasons:
You need to load various flow, state and other class definitions onto
the classpath to pass as arguments to flows, retrieve objects from the
vault, etc.
You need to use the CordaRPCClient library to create an
RPC connection to the node
If you really need to write your back-end
in another language, there are a few workarounds:
Create a thin Java webserver that sits between your main webserver and
the node. The Java webserver translates HTTP requests from the main
webserver into RPC calls to the node, and RPC responses from the node
into HTTP responses to the main webserver
This is the approach taken
by libraries such as Braid
Use a library such as GraalVM to compile
non-JVM languages to JVM bytecode
An example of writing a JVM
webserver in Javascript using GraalVM is available here:
https://github.com/nitesh7sid/cordapp-example-nodejs-server-graalvm
I want to use an ETL service, but i am stuck between Apache Airflow and Matillion.
Are they the same?
What are the main differences?
Airflow's primary use-case is orchestration / scheduling, not ETL. You can perform ETL tasks inside Airflow DAGs, but unless you're planning on implementing Airflow using a containerized / K8 architecture, you'll quickly see performance bottlenecks and even hung / stuck processes. There are ways to mitigate this, certainly, but it's not the primary use case.
Matillion's primary use-case is ETL (really ELT), so it's not going to suffer the same performance issues, or require a complex infrastructure to achieve that performance. It also provides a GUI based code-optional interface, so that you don't have to be a Python expert to achieve results quickly.
I actually view Airflow and Matillion as complimentary (potentially). If you have inter-application dependencies, for example, you can orchestrate Matillion workflow with Airflow, or another third-party scheduler, and gain the benefits of both.
I've never used Matillion. So I can't answer with respect to any specific use case you have.
But with the quick analysis on Matillion I can very well tell that Matillion and Airflow aren't the same at all.
Matillion is a Extract/Transform/Load tool. You can compare it with tools like AWS Glue / Apache NiFi / DMExpress.
Airflow is an orchestration tool. You can compare it with tools like oozie.
More importantly Matillion doesn't come free of cost.
I know Airflow is called workflow manager, nifi dataflow manager, but what this means exactly? The best explanation so far was that nifi cares about data while airflow cares about tasks, but I don't quite get this definition, and I couldn't find any other good explanation/article/video that explains how to integrate this systems, if it is a good idea or is better to use each one in their own.
Also I was thinking if it is better StreamSets or NiFi, I think streamsets looks better in UI and monitor the data, but I heard that depends on the case, that nifi is better if I only ingest data, but again I can't find much information about this questions.
As you said, Airflow is a workflow manager. It means that it only tells other programs to run. It doesn't process data, but tells other to run.
NiFi and StreamSets on the other hand, process data, transform it, receive it and send it. That's why they are dataflow managers.