Apache airflow: install two instances in the same local machine - airflow

I have an Airflow instance in a local Ubuntu machine. This instance doesn't work very well, so I would like to install it again. The problem is that I can't delete the current instance, because it is used by other people, so I would like to create a new Airflow instance in the same machine to put various dags there.
How could I do it? I created a different virtual environment, but I don't know how to install a second airflow server in that environment, which works in parallel with the current airflow.
Thank you!

use different port for webserver
use different AIRFLOW_HOME variable
use different sql_alchemy_conn (to point to a different database)
copy the deployment you have to start/stop your airflow components.
Depending on your deployment you might somehew record process id of your running airflow (so called pid-files) or have some other way to determine which processes are running. But that is nothing airflow-specific, this is something that is specific for your deployment.

Related

Best way to distribute code to airflow webserver / scheduler + workers and workflow

What do people find is the best way to distribute code (dags) to airflow webserver / scheduler + workers? I am trying to run celery on a large cluster of workers such that any manual updates are impractical.
I am deploying airflow on docker and using s3fs right now and it is crashing on me constantly and creating weird core.### files. I am exploring other solutions (ie StorageMadeEasy, DropBox, EFS, a cron job to update from git...) but would love a little feedback as I explore solutions.
Also how do people typically make updates to dags and distribute that code? If one uses a share volume like s3fs, every time you update a dag do you restart the scheduler? Is editing the code in place on something like DropBox asking for trouble? Any best practices on how update dags and distribute the code would be much appreciated.
I can't really tell you what the "best" way of doing it is but I can tell you what I've done when I needed to distribute the workload onto another machine.
I simply set up an NFS share on the airflow master for the both the DAGS and the PLUGINS folders and mounted this share onto the worker. I had an issue once or twice where the NFS mount point would break for some reason but after re-mounting the jobs continued.
To distribute the DAG and PLUGIN code to the Airflow cluster I just deploy it to the master (I do this by bash script on my local machine which just SCPs the folders up from my local git branch) and NFS handles the replication to the worker. I always restart everything after a deploy, I also don't deploy while a job is running.
A better way to deploy would be to have GIT on the airflow master server which checks out a branch from a GIT repository (test or master depending on the airflow server?) and then replace the dags and plugins with the ones in the git repository. I'm experimenting with doing deployments like this at the moment with Ansible.

Apache Airflow environment setup

Can a single installation of Apache Airflow be used to handle multiple environments? eg. Dev, QA1, QA2, and Production (if so please guide) or do I need to have a separate install for each? What would be the best design considering maintenance of all environments.
You can do whatever you want. If you want to keep a single Airflow installation to handle different environments, you could switch connections or Airflow variables according to the environment.
At the end of the day DAGs are written in Python so you're really flexible.

How to migrate airflow variables between DEV and PROD environments?

We are using airflow to schedule our data pipelines, as part of it we also have added few connections and variables in airflow admin.
Everything worked fine in DEV, now we want to setup PROD environment. How do we migrate these values into PROD environment.
You can list or export variables and connection through the command line: https://airflow.apache.org/cli.html
Relevant commands:
airflow variables -e variables.json
airflow connections --list
Variables, I generally have JSON files in our code repo to store non sensitive variables for different environments, which can then be imported via the command line easily and changes are tracked through git.
For connections the other option which is possible is to use environment variables instead of setting up in the UI, you can set connection properties using AIRFLOW_CONN_{CONNECTION_NAME} for example AIRFLOW_CONN_AWS_DEFAULT for connection aws_default
The value stored in the variable must be in a URI format i.e. postgres://user:password#localhost:5432/master or s3://accesskey:secretkey#S3

steps for launching 2nd instance of drupalvm

Am I overlooking something in DrupalVM documentation? To run multiple instances, what steps do people follow?
I’ve seen mentions that after you’ve launched DrupalVM once, it’s quick to launch another instance.
Various approaches have had the same results, including some of the helpers on http://docs.drupalvm.com/en/latest/other/management-tools as well as the vagrant plugin vagrant-cachier. With each, starting a new instance takes the same (very long) length of time….
First, do you really need to launch 2 same machines at once? You can have multiple websites (vhosts) on one VM. That way you would save some computer resources (memory). Edit hosts file on your (host) machine to match web server settings, where you defined your website
But if you insist, should be possible to copy whole project dir, change ip of one of those 2 machine (config.vm.network "private_network", ip: "192.168.something.something" in vagrant file) and run them simultaneously.

How to use a virtual machine with automated tests?

I am attempting to setup automated tests for our applications using a virtual machine environment.
What I would like to have is something like the following scenario:
Build server is automatically triggered to start an automated test for the application
A "build" script is then run which consist of:
Copy application files and a test script to a location accessible by the VM
Start the VM
In the VM, a special application looks in the shared folder and start the test script
The tests script do its job, results are output to shared folder
Test script ends
The special application then delete the test script
The special application somehow have the VM manager close the VM and revert to the previous snapshot
When the VM has exited, process the result and send to build server.
I am using TeamCity if that matters.
For virtual machines, we use VirtualBox but we are open to any other if needed.
Is there any applications/suite that would manage this scenario?
If there are none then I would then code it myself, should be easy but the only part I am not sure is the handling of the virtual machine.
What I need to be able to do is to have the VM close itself after the test and revert to a previous snapshot since I want it to be in a known state for the next test.
Any pointers?
I have a similar setup running and I chose to use Vagrant as its the same thing our developers where using for normalizing the development environment.
The initial state of the virtualmachine was scripted using puppet, but we didn't run the deployment scripts from scratch on each test, only once a day.
You could use puppet/chef for everything, but for all other operations on the VM, we would use Fabric scripts, as they were used for the real deployment too, and somehow fitted how we worked better. In sum the script would look something like the following:
vagrant up # fire up the vm, and run the puppet provisioning tool
fab vm run_test # run tests on vm
fab local process_result # process results on local shared folder
vagrant destroy # destroy the vm
The advantage is that your developers can also use vagrant to mimic your production environment without having to take care of that themselves (i.e. changes to your database settings get synced to all your developers vm's wherever they are) and the same scripts can be used in production too.
VirtualBox does have a COM API. I have no experience with it, but it may be possible to use that. One option would be to have TeamCity fire off a script to do this. I'd suggest starting with NAnt (supported natively by TeamCity) and possibly executing PowerShell if necessary.
Though I don't have any experience with either, I happen to have heard of a couple applications in this space recently:
http://www.infoq.com/news/2011/05/virtual_machine_test_harness
http://www.automatedqa.com/techpapers/testcomplete/automated-testing-in-virtual-labs/

Resources