I have a questions regarding differences between Apache Airflow and Metaflow(https://docs.metaflow.org/). As far as I understood Apache airflow is just a job scheduler, that runs tasks. Metaflow from Netflix is as a dataflow library, which creates machine learning pipeline(dataflow is available) in forms of DAGs. Basically it means, that Metaflow can be executed on the Apache Airflow?
is my understanding correct?
If yes, is it possible to convert Metaflow DAG into Apache Airflow DAG?
Honestly, I haven't worked with Metaflow and thank you for introducing it to me! There is a nice introduction video you can find on Youtube.
Airflow is a framework for creating scheduled pipelines. A pipeline is a set of tasks, linked between each other that represent an Directed Acyclic Graph. Pipeline can be scheduled, you can tell how often or when it should run, you can tell when it should've ran in the past and what time period it should backfill. You can run the whole Airflow as one single docker container or you can have multi-node cluster, it has bunch of already existing operators to integrate with 3rd party services. I recommend to look into Airflow Architecture and concepts.
Metaflow looks like something similar, but created specifically for data-scientists. I can be wrong here, but looking at the Metaflow Basics it looks like I can the same way create a scheduled pipeline similar to Airflow.
I would look in specific tools you want to integrate with and which one of both integrates better. As mentioned, Airflow has lots of already made connectors and operators, as well as, powerful scheduler with backfill and Jinja template language to design your DB queries for enter link description here.
Hope that is somewhat helpful.
Here is also some nice article with feature comparison.
Related
So lets say I want to use other monitoring platforms such as ManageEngine and DataDog on Apache Airflow. So two ways we can maybe do this is one directly communicate it and also we could write a table that the other platforms can read. How would I do both these approaches so that that Airflow can tell these monitoring tools when the jobs start/stop.
After some research and testing, we have decided to start using Google Cloud Composer. Since our current DAGs and tasks are relatively small, and don't require the server to run continuously, I am looking how to manage costs.
Two questions:
The option to use preemptible VMs seems logical. This saves costs considerably, and I'm thinking to go for 3x n1-standard-4. I expect each task to be quite short, so don't think this will have significant impact for our workloads. Is it possible to use preemptible VMs with Composer?
Schedule to turn the Composer environment on/off, as asked in this post. I can't find how to do this in the documentation, either by turning the whole enviroment down, or to shutdown the workers as proposed in the answer.
Help, anyone?
This is an interesting question.
One roadblock you may encounter is the nature of Airflow itself. Generally, Airflow is not intended for use ephemerally. Instead, I'd suspect that the vast majority of Airflow use, Cloud Composer or otherwise, is persistent. Ephemerality brings cost benefits but also risks with Airflow architecture. For example, what happens if the scheduler to restart your Airflow resources fails?
To answer your questions:
Preemptibles are not supported in Composer. While PVMs have a ton of awesome benefits, they could leave tasks in a very weird state, especially if you got preempted several times.
There is not formal documentation for this process because it's generally informal and not recommended if you must depend on your environment. The basic approach, though, would be to:
Create a very small GCE VM
Setup the Cloud SDK (gcloud) to connect to your project
Create a crontab that either does a fresh create/delete of an environment when you need it /or/ pauses the VMs in the Composer worker pool
In the long-term, I think Composer will better support ephemeral use of worker resources. In the short term, another option is to run a lightweight Airflow environment on a small(ish) GCE VM and then suspend/resume that VM when you need to use Airflow. You don't get Composer that way, but you do benefit from the team's work improving and expanding GCP support in core Airflow.
Has anyone reported how much they've been able to get Airflow to scale at their company? I'm looking at implementing Airflow to execute 5,000+ tasks that will each run hourly, and someday scale that up to 20,000+ tasks. In examining the scheduler it looks like that might be a bottleneck since only one instance of it can run, and I'm concerned with that many tasks the scheduler will struggle to keep up. Should I be?
We run thousands of tasks a day at my company and have been using Airflow for the better part of 2 years. These dags run every 15 minutes and are generated through config files that can change at any time (fed in from a UI).
The short answer - yes, it can definitely scale to that, depending on your infrastructure. Some of the new 1.10 features should make this easier than the version of 1.8 we run that runs all those tasks. We ran this on a large Mesos/DCOS that took a good deal of fine tuning to get to a stable point.
The long answer - although it can scale to that, we've found that a better solution is multiple Airflow instances with different configurations (scheduler settings,number of workers, etc.) optimized for the types dags they are running. A set of DAGs that run long running machine learning jobs should be hosted on an Airflow instance that is different from the ones running 5 minute ETL jobs. This also makes it easier for different teams to maintain the jobs they are responsible for and makes it easier to iterate on any fine tuning that's needed.
We came across choice to work on Scheduler named Azkaban with good UI benefits and dependencies resolution.
Read through the document https://azkaban.github.io/azkaban/docs/latest/ and started using.
Struggle to mark it stable for long run of scheduled jobs.
So, what is the azkaban stability point of view in terms of:
any challenges faces?
critical issues?
Response from community support for Azkaban?
I'm new with cloudify (4.2) and trying to exercise sheduling workflows. On Cloudify Roadmap i found this feature:
Scheduled Workflow Execution: The ability to schedule a workflow execution at a future time, such as scaling the number of web server VMs at a certain time of the day
But unfortunately i can't find that in the documentation nor a small example on how to do this, what services, policies should i use.
Any hints and suggestions will be greatly appreciated.
Keep in mind that this feature is in development, which is the reason you found it on the Roadmap page, and it will not be found anywhere in the docs either. We don't yet know to which upcoming release this will be added.
If you have any other questions, please head over and ask on the User Group.