Pipeline Dependency Graph - airflow

I am looking to create a dependency graph for a few pipelines in my cluster. I am trying to show the start and end point of my data and all the flows of data in between the two points. I am looking to use either apache airflow or apache falcon to accomplish this task. Please let me know if you have any suggestions on what tool to use and how to get started with this project. Please also link any documentation related to apache falcon. Thank you.

Apache Airflow has a nice UI which allows you to view your DAGs as a tree, graph, or even Gantt chart. Check this out for more information.

You can create your own UI views in Airflow, allowing you to display whatever you're looking for. See Airflow plugins documentation. Have a look also at the Airflow plugins github for examples

Related

Datadog cannot find check in "catalog" when implementing integration

I've been trying to implement a Datadog integration, more specifically Airflow's. I've been following this piece of documentation using the containerized approach (I've tried using pod annotations and adding the confd parameters to the agent's Helm values). I've only made advances when adding the airflow.yaml config to the confd sectrion of the cluster agent. However, I get stuck when I try to validate the integration as specified in the documentation by running datadog-cluster-agent status. Under the "Running Checks" section, I can see the following:
airflow
-------
Core Check Loader:
Check airflow not found in Catalog
On top of being extremely generic, this error message mentions a "Catalog" that is not referenced anywhere else in the DD documentation. It doesn't tell me or give me any hints on what could possibly wrong with the integration. Anyone had the same problem and knows how to solve it or at least how can I get more info/details/verbosity to debug this issue?
You may need to add cluster_check: true to your airflow.yaml confd configuration.

Airflow : How to disable User from editing Variables/Connections in GUI in Production Environment

We want to avoid users to manually editing or adding new variables/connections from Airflow GUI in Production. Currently, we are using JSON files which loads all the connections and variables for Airflow.
Can experts guide me on how to achieve this?
Sergiy's right, you will need to start looking at Airflow's Role Based Access (RBAC) documentation and implement from there. Here's a place to start: https://airflow.readthedocs.io/en/latest/howto/add-new-role.html

Airflow DAG Versioning

Is DAG versioning a thing ? I can't find much on the subject with a few Google searches. I would like to look at the DAGs screen in Airflow and be sure of what DAG code is in the wild.
The simplest solution would be to include a version number as part of the dag_id, but I would appreciate knowing if anyone has better, alternative solution. Tags would work too and migjht look good in the UI - they are designed for for filtering though, I'm not sure if there would be undesirable side-effects.
As the author of the DAG Versioning AIP, I can say that this work has been deferred post 2.0 mainly to support end-to-end DAG Versioning.
Originally, we (Airflow Core Committers) were planning to have a Webserver-only DAG Versioning i.e. to improve the visibility behaviour but not execution:
The scope of this AIP to make sure that the visibility behavior of
Airflow is correct, without changing the execution behaviour which
will continue to be based on the most recent version of the DAG.
This means it overcomes the issues where you can go back to an old version of the DAG, to view the shape of the DAG few months back and you can see the correct representation instead of "always-latest".
Currently, Airflow suffers from the issue where if you add/remove a task, it gets added/removed in all the previous DagRuns in the Webserver.
However, what we have decided is that we will accomplish Remote DAG Fetcher + DAG Versioning and enable versioning of DAG on the worker side, so a user will be able to run a DAG with the previous version too.
Currently, we don't have a date but mostly planning to do it around the end of 2021.
The Airflow project has a draft feature open to support DAG versions. The answer currently is that Airflow does not support versions.
The first use case in the link describes a key limitation, log files from previous runs can only surface nodes from the current DAG.
As mentioned above, as of yet, Airflow doesn't has its own functionality of versioning workflows. However you can manage that on your own by managing DAGs on their own git repository and fetching its state into airflow reposiroty as submodules. More on that;
https://www.youtube.com/watch?v=a-4yRne3ba4&lc=UgwiIO-ECVFSZPz1hOt4AaABAg

Writing an appspec.yml File for Deployment from S3 (and/or Bit Bucket) to AWS CodeDeploy

I'd like to make it so that a commit to our BitBucket repo (or S3 Bucket) automatically deploys code (using CodeDeploy) to our EC2 instances. I'm not clear what to use for the 'source' and 'destination' entry under the 'files' section in the appspec.yml file and also I am not cleared what to mention in BeforeInstall and AfterInstall under 'Hooks' section. I've found some examples on Google and AWs documentation but I am confused what to mention in above fields. The more I am exploring more I am getting confused.
Consider I am new to AWS Code Deploy.
Also it will be very helpful if someone can provide me step y step link how to configure and how to automate the CodeDeploy.
I was wondering if someone could help me out?
Thanks in advance for your help!
Thanks for using CodeDeploy. For new users, I'd like to recommend the following things to do:
Try to run First Run Wizard on console, it will should you the general process how the deployment goes. It also provide a default deployment bundle, also an appspec file included.
Once you want to try a deployment yourself, the Get Started doc is a great place to help you with some pre-requiste settings like IAM role
Then probably try some tutorials for a sample app too, which gives you some idea about deployment groups, deployment configuration, revision and so on.
The next step should be create a bundle for your own use cases, Appspec file doc would be a great place to refer. And for your concerns about BeforeInstall and AfterInstall, if your application doesn't need to do anything, the lifecycle events can be left as empty. BeforeInstall can be used to for for preinstall tasks, such as decrypting files and creating a backup of the current version, while AfterInstall can be used for tasks such as configuring your application or changing file permissions.
Now it comes to the fun part! This blog talks about details about how to integrate with Github(similar for Bitbucket). It's a little long, but really useful, and it also includes how to do automatically deployment once there is a new pushed commit. Currently Jenkins and CodePipline are really popular for auto-triggered deplyoments, but there are always a lot of other ways can achieve the same purpose like Lamda and so on

Using the simpletest automator in Drupal 6

I've been trying to learn how to use simpletest, and I found the simpletest automator. I was able to install it and run it, but where is the file with the results of the 'macro' saved? I haven't been able to find it.
Also, is there a quick way to duplicate a drupal install in simpletest? I know it starts from a clean install, but I don't want to have to go through and figure out what all is enabled and who has what permissions at the start of the test. Is there a script that can figure out the settings of the current drupal install?
Thank You.
Is there a script that can figure out the settings of the current drupal install?
The short answer is no.
Essentially simpletest should be used as a unit test framework. Where all of the data that is needed is set up at the beginning of the test and it is not reliant on system setting or a particular user having a permission. It does this quite well, and can test core functionality and individual modules easily. If you are testing an indavidual module you have written using simpletest is, well, simple.
Unfortunately most websites use a number of modules and are configured to work together in a very specific way. Simpletest doesn't cope with this very well.
There are ways to get around this:
One option is to write a setup script in php which will work as a big setup script for your test. This can create users, set settings and permissions. This can be dificult to write and maintain and can cause the tests to take a long time to run.
Another option is for the site testing (which is different from unit testing) to be done in a tool other than simpletest. I have had some success with selenium. The downside to this is that you need to find a way have clean data. Which can be tricky, copying a database works but doesn't scale.
I've been pointed to this blog post as an answer to the question: http://www.trellon.com/content/blog/forcing-simpletest-use-live-database
You can also use a site deployment module and enable only that at the very beginning of your test (in your SetUp() function).

Resources