I want to use an ETL service, but i am stuck between Apache Airflow and Matillion.
Are they the same?
What are the main differences?
Airflow's primary use-case is orchestration / scheduling, not ETL. You can perform ETL tasks inside Airflow DAGs, but unless you're planning on implementing Airflow using a containerized / K8 architecture, you'll quickly see performance bottlenecks and even hung / stuck processes. There are ways to mitigate this, certainly, but it's not the primary use case.
Matillion's primary use-case is ETL (really ELT), so it's not going to suffer the same performance issues, or require a complex infrastructure to achieve that performance. It also provides a GUI based code-optional interface, so that you don't have to be a Python expert to achieve results quickly.
I actually view Airflow and Matillion as complimentary (potentially). If you have inter-application dependencies, for example, you can orchestrate Matillion workflow with Airflow, or another third-party scheduler, and gain the benefits of both.
I've never used Matillion. So I can't answer with respect to any specific use case you have.
But with the quick analysis on Matillion I can very well tell that Matillion and Airflow aren't the same at all.
Matillion is a Extract/Transform/Load tool. You can compare it with tools like AWS Glue / Apache NiFi / DMExpress.
Airflow is an orchestration tool. You can compare it with tools like oozie.
More importantly Matillion doesn't come free of cost.
Related
I continue to read that Airflow is meant to be an orchestration engine and not an execution engine. But what does that mean?
I know that Kubernetes is also called an orchestration tool. So is the idea that Airflow is simply meant to define high-level dependency graphs but not any of the code to actually perform said tasks? I am still really struggling to understand the paradigm that Airflow is operating with.
I am working on a project that grabs a set of input data from AWS S3, pre-processes and divvies it up, spins up 10K batch containers to process the divvied data in parallel on AWS Batch, post-aggregates the data, and pushes it to S3.
I already have software patterns from other projects for Airflow + Batch, but have not dealt with the scaling factors of 10k parallel tasks. Airflow is nice since I can look at which tasks failed and retry a task after debugging. But dealing with that many tasks on one Airflow EC2 instance seems like a barrier. Other option would be to have one task that kicks off the 10k containers and monitors it from there.
I have no experience with Step Functions, but have heard it's AWS's Airflow. There looks to be plenty of patterns online for Step Functions + Batch. Does Step Functions seem like a good path to check out for my use case? Do you get the same insights on failing jobs / ability to retry tasks as you do with Airflow?
I have worked on both Apache Airflow and AWS Step Functions and here are some insights:
Step Functions provide out of the box maintenance. It has high availability and scalability that is required for your use-case, for Airflow we'll have to do to it with auto-scaling/load balancing on servers or containers (kubernetes).*
Both Airflow and Step Functions have user friendly UI's. While Airflow supports multiple representations of the state machine, Step Functions only display state machine as DAG's.
As of version 2.0, Airflow's Rest API is now stable. AWS Step Functions are also supported by a range of production graded cli and SDK's.
Airflow has server costs while Step Functions have 4000/month free step executions (free tier) and $0.000025/step after that. e.g. if you use 10K steps for AWS Batch that run once daily, you will be priced $0.25 per day ($7.5 per month). The price for Airflow server (t2.large ec2 1 year reserved instance) is $41.98 per month. We will have to use AWS Batch for either case.**
AWS Batch can integrate to both Airflow and Step Functions.
You can clear and rerun a failed task in Apache Airflow, but in Step Functions you will have to create a custom implementation to handle that. You may handle automated retries with back-offs in Step Functions definition as well.
For failed task in Step Functions you will get a visual representation of failed state and the detailed message when you click it. You may also use aws cli or sdk to get the details.
Step Functions use easy to use JSON as state machine definition, while Airflow uses Python script.
Step Functions support async callbacks, i.e. state machine pauses until an external source notifies it to resume. While Airflow has yet to add this feature.
Overall, I see more advantages of using AWS Step Functions. You will have to consider maintenance cost and development cost for both services as per your use case.
UPDATES (AWS Managed Workflows for Apache Airflow Service):
*With AWS Managed Workflows for Apache Airflow service, you can offload deployment, maintenance, autoscaling/load balancing and security of your Airflow Service to AWS. But please consider the version number you're willing to settle for, as AWS managed services are mostly behind the latest version. (e.g. As of March 08, 2021, the latest version of open source airflow is 2.01, while MWAA allows version 1.10.12)
**MWAA costs on environment, instance and storage. More details here.
I have used both Airflow and Step Functions in my personal and work projects.
In general I liked step functions but the fact that you need to schedule the execution with Event Bridge is super annoying. Actually I think here Airflow could just act as a triggered for the step functions.
If Airflow would be cheaper to manage, I would always opt for it because I find managing Json based pipelines a hustle whenever you need to detour from the main use case. This always happen for me somehow.This becomes even a more complex issue when you need to have source control.
This one is a bit more subjective assessment but I find the monitoring capability of Airflow far greater than for step functions.
Also some information about the usage of Airflow vs Step functions
Aws currently has managed airflow which is priced per hour and you don’t need to have dedicated ec2. On the other hand step functions are aws lambdas that have an execution time limit of 15min which makes them not the best candidate for a long running pipelines
I am confused with my little knowledge of Redis, Zookeeper, Solr.
Help me in understanding network architecture of Redis Sentinel, zookeeper.
Both Redis Sentinel and Zookeeper at a high level look functionally similar. Choosing master and slaves and monitoring.
Redis cluster was introduced with different architecture where separate servers for monitoring not required. This also mentioned as cons of Sentinel.
Solr document says in production it is good to setup external zookeeper to maintain Solr.
Can someone explain me on the network protocols/architecture level like why one is good and the other one is not?
--Updated
My question is not specific to Redis Sentinel or Solr. Rather, it is on the architecture.
In Redis keeping sentinel outside was not really helping. It was creating unnecessary overhead as Sentinel also needs to be maintained in separate servers.
So they came up with Redis cluster where no external servers required for maintaining/choosing master/slave.
In case of Solr, though they have internal zookeeper, they suggest to keep external zookeeper in production Solr as best practice.
In above cases, for me it looks like both are architecturally opposite, which they say as best practice.
Please help me to understand at an architecture level, how it is helping in Solr and not in Redis usecase
I know Airflow is called workflow manager, nifi dataflow manager, but what this means exactly? The best explanation so far was that nifi cares about data while airflow cares about tasks, but I don't quite get this definition, and I couldn't find any other good explanation/article/video that explains how to integrate this systems, if it is a good idea or is better to use each one in their own.
Also I was thinking if it is better StreamSets or NiFi, I think streamsets looks better in UI and monitor the data, but I heard that depends on the case, that nifi is better if I only ingest data, but again I can't find much information about this questions.
As you said, Airflow is a workflow manager. It means that it only tells other programs to run. It doesn't process data, but tells other to run.
NiFi and StreamSets on the other hand, process data, transform it, receive it and send it. That's why they are dataflow managers.
We are using Nitrogen-SR3 version of Opendaylight. We want to support more NEs, and during our testing observed "Shards" missing. While analyzing this issue we noticed Shards are missing as LevelDB is not acknowledging the writes as it is very busy. We came across Cassandra plugin for "Akka" persistence, will it be a good idea to use Cassandra instead of LevelDB so that we can scale better.
Please advice us whether there are any production deployment with Cassandra plugin for Akka persistence.
Of course you can use whatever plugin suits your environment and needs. I'm not aware of anyone using Cassandra. LevelDB is suitable as a default as it's simple and doesn't require any external server. It seems to work fine for most use cases even though akka doesn't recommend it for production.
I assume you're probably hitting the (dreaded) circuit-breaker timeout in akka when the plugin response isn't timely which can happen with a slow disk or saturation. The default timeout is 5 sec but is configurable (check the akka persistence docs).