Has anyone reported how much they've been able to get Airflow to scale at their company? I'm looking at implementing Airflow to execute 5,000+ tasks that will each run hourly, and someday scale that up to 20,000+ tasks. In examining the scheduler it looks like that might be a bottleneck since only one instance of it can run, and I'm concerned with that many tasks the scheduler will struggle to keep up. Should I be?
We run thousands of tasks a day at my company and have been using Airflow for the better part of 2 years. These dags run every 15 minutes and are generated through config files that can change at any time (fed in from a UI).
The short answer - yes, it can definitely scale to that, depending on your infrastructure. Some of the new 1.10 features should make this easier than the version of 1.8 we run that runs all those tasks. We ran this on a large Mesos/DCOS that took a good deal of fine tuning to get to a stable point.
The long answer - although it can scale to that, we've found that a better solution is multiple Airflow instances with different configurations (scheduler settings,number of workers, etc.) optimized for the types dags they are running. A set of DAGs that run long running machine learning jobs should be hosted on an Airflow instance that is different from the ones running 5 minute ETL jobs. This also makes it easier for different teams to maintain the jobs they are responsible for and makes it easier to iterate on any fine tuning that's needed.
Related
I have a questions regarding differences between Apache Airflow and Metaflow(https://docs.metaflow.org/). As far as I understood Apache airflow is just a job scheduler, that runs tasks. Metaflow from Netflix is as a dataflow library, which creates machine learning pipeline(dataflow is available) in forms of DAGs. Basically it means, that Metaflow can be executed on the Apache Airflow?
is my understanding correct?
If yes, is it possible to convert Metaflow DAG into Apache Airflow DAG?
Honestly, I haven't worked with Metaflow and thank you for introducing it to me! There is a nice introduction video you can find on Youtube.
Airflow is a framework for creating scheduled pipelines. A pipeline is a set of tasks, linked between each other that represent an Directed Acyclic Graph. Pipeline can be scheduled, you can tell how often or when it should run, you can tell when it should've ran in the past and what time period it should backfill. You can run the whole Airflow as one single docker container or you can have multi-node cluster, it has bunch of already existing operators to integrate with 3rd party services. I recommend to look into Airflow Architecture and concepts.
Metaflow looks like something similar, but created specifically for data-scientists. I can be wrong here, but looking at the Metaflow Basics it looks like I can the same way create a scheduled pipeline similar to Airflow.
I would look in specific tools you want to integrate with and which one of both integrates better. As mentioned, Airflow has lots of already made connectors and operators, as well as, powerful scheduler with backfill and Jinja template language to design your DB queries for enter link description here.
Hope that is somewhat helpful.
Here is also some nice article with feature comparison.
After some research and testing, we have decided to start using Google Cloud Composer. Since our current DAGs and tasks are relatively small, and don't require the server to run continuously, I am looking how to manage costs.
Two questions:
The option to use preemptible VMs seems logical. This saves costs considerably, and I'm thinking to go for 3x n1-standard-4. I expect each task to be quite short, so don't think this will have significant impact for our workloads. Is it possible to use preemptible VMs with Composer?
Schedule to turn the Composer environment on/off, as asked in this post. I can't find how to do this in the documentation, either by turning the whole enviroment down, or to shutdown the workers as proposed in the answer.
Help, anyone?
This is an interesting question.
One roadblock you may encounter is the nature of Airflow itself. Generally, Airflow is not intended for use ephemerally. Instead, I'd suspect that the vast majority of Airflow use, Cloud Composer or otherwise, is persistent. Ephemerality brings cost benefits but also risks with Airflow architecture. For example, what happens if the scheduler to restart your Airflow resources fails?
To answer your questions:
Preemptibles are not supported in Composer. While PVMs have a ton of awesome benefits, they could leave tasks in a very weird state, especially if you got preempted several times.
There is not formal documentation for this process because it's generally informal and not recommended if you must depend on your environment. The basic approach, though, would be to:
Create a very small GCE VM
Setup the Cloud SDK (gcloud) to connect to your project
Create a crontab that either does a fresh create/delete of an environment when you need it /or/ pauses the VMs in the Composer worker pool
In the long-term, I think Composer will better support ephemeral use of worker resources. In the short term, another option is to run a lightweight Airflow environment on a small(ish) GCE VM and then suspend/resume that VM when you need to use Airflow. You don't get Composer that way, but you do benefit from the team's work improving and expanding GCP support in core Airflow.
We came across choice to work on Scheduler named Azkaban with good UI benefits and dependencies resolution.
Read through the document https://azkaban.github.io/azkaban/docs/latest/ and started using.
Struggle to mark it stable for long run of scheduled jobs.
So, what is the azkaban stability point of view in terms of:
any challenges faces?
critical issues?
Response from community support for Azkaban?
We have a C#/ASP .Net web application that is built and deployed by the build server (Jenkins). One of the build steps before the automated deployment is ensuring that all automated tests pass -- including functional tests we have using Selenium 2 WebDriver and NUnit.
The problem: Sometimes these tests fail randomly. They will succeed for 100 builds and then one just fails. They fail for various reasons -- a .Click() event is just ignored, element can't be found, IE has a bad day, etc. We have an AJAX heavy web app and so we depend heavily on WebDriverWaits but we always take this into account while writing tests, and like I said the tests do pass most of the time.
What are some ways to avoid or fix this problem? A couple that came to my mind:
Accept a certain number of failures (seems like a bad idea)
Rerun test failures?
I don't like either of the suggestions that you mention, but I admit to having used them occasionally. The best thing to do is to make sure that when there is a seemingly "random" failure to do everything you can to get all of the data about why it really failed. Was it an environment issue? Did some other process on the machine interfere with the tests? Was it a timing issue that only appears when the site loads excruciatingly slow, or blazing fast?
One thing that you might try is soak testing your automated tests. Run each one 100+ times on the same build and same environment (so you can rule those out as potential failure points) and find the ones that fail occasionally. See if they fail in the same place or in different places. Generally, when you go through this exercise you'll find some tests that really are a little bit flaky and you can remove them from the daily run until they are fixed. You could even include a soak as a check-in criteria for any automated test case.
Another useful thing I have found that helped me get to the bottom of some of the seemingly random failures was taking screenshots on failure. Often you can see that other windows or dialogs were popped up causing the browsers not to be able to be in the forefront, etc.
Of the two, I would prefer to rerun test failures, or rather, on test failure, retry the tests.
If you accept a certain number of test failures, then you get into problems about which tests are allowed to fail. You would have to have two sets of tests, some which are allowed to fail, some which are not.
For rerunning, I'm no expert on testing with NUnit, but you could have the tests themselves manage the retry. In JUnit, you can introduce a rule so that if a test fails, it would retry a maximum of 3 times. This would probably avoid most of the problems you're having. I don't know how to do this in NUnit, but see my answer to How to Re-run failed JUnit tests immediately?. This will give you the general idea.
I’m working in a company that uses Continuous Integration (TeamCity). Every time someone does a check in the CI software starts a build and runs all the Unit/Automated Test . The problem is that we got more than 7000 unit tests + 756 automated tests (used to test the JavaScript as we got a very complex UI logic for making calculation etc). As you can imaging every time someone does a check in the all process takes more than 2 hours to go through all the steps (build-unitest-automated test) so that I need to wait that much before I can get a result to understand if my check in has broken perhaps an automated test or a unit test. Worst situation is when more than one people check in something so that TeamCity start queue up the build and before I can get a valid result (udated) I can wait up to half day ! what strategy should we adopt to speed up a bit this process? Is it a best practice run all the automated tests even against a little change?
I would look at breaking up your test suite in two ways - with the goal of making it so you and your team can check in, go get a cup of coffee and have some meaningful feedback from team city when you get back to your desk.
decide what you really want to test on every commit, move the remaining tests to a suite that runs at a scheduled interval (hourly, nightly - whatever works for you).
If the set of tests agreed upon to run every commit is still large - break that set up and distribute across multiple nodes running in parallel.
You may also want to beef up your CI machine, depending on the nature of your stuff have the working directory for the tests live in tmpfs (RAM disk).
I'm going to talk in theory, I have yet to put it into practice but CI is on my goals to have up and humming by the end of the summer.
From statements I've seen made by the people that have earned the most respect from me in developers the most common element for CI I've seen in regards to the testing strategy is to split your tests into Long Running and Short Running.
Then you would want to configure that standard check ins kick off the short running test for basic validation of the solution. Then on the nightly builds, and for deployment builds is the only time you NEED to run the full test suite to give your validation of regression tests.
Aside/Alternate answer: Seeing as I haven't setup CI for myself yet, I had never understood the TeamCity business model that they making the pricing based on build agents. Now I understand why multiple build agents really start to matter if your test suite takes that long, being able to run 5 builds at once becomes much more important. So one option could be to just spend more money and stick a band-aid on the bullet hole for now.
Continuous Integration works best with a distributed version control system like Git or Mercurial.
Every developer can check in often into their local repository without triggering the whole integration and UI testing ceremony all the time.
Once a feature is finished locally, it can be checked in to the central repository. Thus the CI server runs all the time-consuming tests only when new features and/or fixes have been added.
Have you considered using pre-tested commits? If you run a remote run build (without committing in to VCS), you can be sure that you didn't break anything in VCS (just because you didn't commit yet). And you can continue working without problems. If the build is successful, you can commit your change (even if you made some other changes in the same files) - if you commit via TeamCity's plugin it will commit exactly the code you sent to the server for the testing.
This way, you don't have to wait until TeamCity's build has finished to continue working with it.
Disclaimer: I'm a TeamCity developer.