Run Airflow Webserver and Scheduler on different servers - airflow

I was wondering if the Airflow's scheduler and webserver Daemons could be launched on different server instances ?
And if it's possible, why not use serverless architecture for the flask web server ?
There is a lot of resources about multi nodes cluster for workers but I found nothing about splitting scheduler and webserver.
Has anyone already done this ? And what may be the difficulties I will be facing ?

I would say the minimum requirement would be that both instance should have
Read(-write) access to the same AIRFLOW_HOME directory (for accessing DAG scripts and the shared config file)
Access to the same database backend (for accessing shared metadata)
Exactly the same Airflow version (to prevent any potential incompatibilities)
Then just try it out and report back (I am really curious ;) ).

Related

Set up Airflow production environment

I'm a newbie at using Airflow. I went through many Airflow tutorials, and I can say that all are about development environments using a docker-compose file or files. I'm facing a problem at work setting up a production environment properly. My goal is to have a cluster composed of 3 EC2 virtual machines. Can anyone share best practices for installing Airflow on that cluster?
I went through many tutorials on the internet.
Airflow has 4 main components:
Webserver: stateless service which expose the UI and the REST API of Airflow
Scheduler: stateless service which processes the dags and runs them and it's the main component
Worker: stateless service which execute the tasks
Metadata: the database of Airflow where the state is stored, and it manages the communication between the 3 other components
And Airflow has 4 main executors:
LocalExecutor: the scheduler runs tasks by itself by spawning a process for each task, and it works in a single host -> not suitable for your need
CeleryExecutor: the most used scheduler, you can create one or multiple scheduler (for HA), and a group of celery workers to run the tasks, you can scale it on different nodes
DaskExecutor: similar to CeleryExecutor but it uses Dask instead of Celery, not much used, and there is no many resources around it
KubernetesExecutor: it runs each task in a K8S pod, and since it's based on Kubernetes, it's very scalable, but it has some drawbacks.
For you use case, I recommend using CeleryExecutor.
If you can use EKS instead of EC2, you can use the helm chart to install and configure the cluster. And if not, you have other options:
run the services directly on the host:
pip install apache-airflow[celery]
# run the webserver
airflow webserver
# run the scheduler
airflow scheduler
# run the worker
airflow celery worker
You can decide how many scheduler, workers and webserver you want to execute, and you can distribute them on the 3 nodes, for ex: node1(1 scheduler, 1 webserver, 1 worker), node2(1 scheduler, 2 workers), node3(1 webserver, 2 workers), and you need a DB, you can use postgres from AWS RDS, or create it on one of the nodes (not recommended).
using docker: same as the first solution, but your run containers instead of running the services directly on the host
using docker swarm: you can connect the 3 nodes to create a swarm cluster, and manage the config from one of the nodes, this gives you some feature which are not provided by the first 2 solutions, and it's similar to K8S. (doc)
For the 3 solutions, you need to create airflow.cfg file contains the configurations and the DB creds, and you should set the exeutor conf to CeleryExecutor.

Amazon ECS Application level monitoring and instance autohealing without ELB

i'm deploying and app to Amazon ECS and need some advice on application level monitoring (periodic HTTP 200 and/or body match). Usually i place it behind an ELB and i am sure that my ELB will take action if it sees too many HTTP errors.
However this time it's a very low budget project and the budget for the ELB should be avoided (also consider this is going to work with only one instance as the userbase is very limited).
What strategies could i adopt to grant that the application is alive (kill instance and restart in case of too many app errors)? Regarding the instance i know about AWS autohealing but that's infrastructure.
Obviously one of the problems is that not having an ELB i must bind the DNS to an EIP....so reassigning it it's crucial.
And obviously the solution should not involve any other EC2 instance, external services are acceptable but keeping it all inside AWS would be great.
Thanks a lot
Monitoring of ECS is important to improve the importance of your site. If you still think there could be issues related to deployment on AWS, I suggest to practice auto-scaling feature of AWS.
You can scale up ECS when needed and release it when not required.
Nagios is another open source monitoring tool that you can leverage. Easy to install and configure.

Flask SQLAlchemy Database with AWS Elastic Beanstalk - waste of time?

I have successfully deployed a Flask application to AWS Elastic Beanstalk. The application uses an SQLAlchemy database, and I am using Flask-Security to handle login/registration, etc. I am using Flask-Migrate to handle database migrations.
The problem here is that whenever I use git aws.push it will push my local database to AWS and overwrite the live one. I guess what I'd like to do is only ever "pull" the live one from AWS EB, and only push in rare circumstances.
Will I be able to access the SQLAlchemy database which I have pushed to AWS? Or, is this not possible? Perhaps there is some combination of .gitignore and .elasticbeanstalk settings which could work?
I am using SQLite.
Yes, your database needs to not be in version control, it should live on persistent storage (most likely the Elastic Block Storage service (EBS)), and you should handle schema changes (migrations) using something like Flask-Migrate.
The AWS help article on EBS should get you started, but at a high level, what you are going to do is:
Create an EBS volume
Attach the volume to a running instance
Mount the volume on the instance
Expose the volume to other instances using a Network File System (NFS)
Ensure that when new EBS instances launch, they mount the NFS
Alternatively, you can:
Wait until Elastic File System (EFS) is out of preview (or request access) and mount all of your EB-started instances on the EFS once EB supports EFS.
Switch to the Relational Database Service (RDS) (or run your own database server on EC2) and run an instance of (PostgreSQL|MySQL|Whatever you choose) locally for testing.
The key is hosting your database outside of your Elastic Beanstalk environment. If not, as the load increases different instances of your Flask app will be writing to their own local DB. There won't a "master" database that will contain all the commits.
The easiest solution is using the AWS Relational Database Service (RDS) to host your DB as an outside service. A good tutorial that walks through this exact scenario:
Deploying a Flask Application on AWS using Elastic Beanstalk and RDS
SQLAlchemy/Flask/AWS is definitely not a waste of time! Good luck.

Hosting wordpress blog on AWS

I have hosted a wordpress blog on AWS using EC2 instance t1.micro (Ubuntu).
I am not an expert on linux administration. However, after going through few tutorials, I was able to manage to have wordpress successfully running.
I noticed a warning on AWS console that "In case if your EC2 instance terminates, you will lose your data including wordpress files and data stored by MySql service."
Does that mean I should use S3 service for storing data to avoid any accidental data loss? Or my data will remain safe in an EBS volume even if my EC2 instance terminates?
By default, the root volume of an EC2 instance will be deleted if the instance is terminated. It can only be terminated automatically if its running as a spot instance. Otherwise it can only be terminated if you do it.
Now with that in mind, EBS volumes are not failure proof. They have a small chance of failing. To recover from this, you should either create regular snapshots of your EBS volume, or back up the contents of your instance to s3 or other storage service.
You can setup snspshot lifecycle policy to create scheduled volume snapshots.
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/snapshot-lifecycle.html

Security groups and UDP on Heroku

Has anyone experienced running multiple collaborating applications on Heroku? For example, an admin application to manage another application; or a stats server observing another application?
On Amazons' EC2 platform you can use security groups to restrict access to servers, creating a virtual network between your application or server instances. Is there any such way to do this on Heroku? If so, can you open UDP as well as TCP connections?
Thanks
Robbie
The comment from #elithrar is correct. To talk between applications you either need to define an API, or used shared resources. For example you can have 2 applications connect to the same database by manually copying and pasting the DATABASE_URL from one app to another. This has the downside that should we need to roll credentials (very rare) your manually copied configuration will break.
The same pattern can be used with any add-ons, such as https://addons.heroku.com/redistogo or https://addons.heroku.com/iron_mq to share a message bus or queue between two applications.

Resources