Arflow in ECS cluster - airflow

I have previously used Airflow running Ubuntu VM. I was able to login into VM via WinSCP and putty to run commands and edit Airflow related files as required.
But i first time came across Airflow running AWS ECS cluster. I am new to ECS So, i am trying to see what is the best possible way to :
run commands like "airflow dbinit", stop/start web-server and scheduler etc...
Install new python packages like "pip install "
View and edit Airflow files in ECS cluster
I was reading about AWS CLI and ECS cli could they be helpful ? or is there is any other best possible way that lets me do above mentioned actions.
Thanks in Advance.
Kind Regards
P.

There are many articles that describe how to run Airflow on ECS:
https://tech.gc.com/apache-airflow-on-aws-ecs/
https://towardsdatascience.com/building-an-etl-pipeline-with-airflow-and-ecs-4f68b7aa3b5b
https://aws.amazon.com/blogs/containers/running-airflow-on-aws-fargate/
many more
[ Note: Fargate allows you to run containers via ECS without a need to have EC2 instances. More here if you want additional background re what Fargate is]
Also, AWS CLI is a generic CLI that maps all AWS APIs (mostly 1:1). You may consider it for your deployment (but it should not be your starting point).
The ECS CLI is a more abstracted CLI that exposes higher level constructs and workflows that are specific to ECS. Note that the ECS CLI has been superseded by AWS Copilot which is an opinionated CLI to deploy workloads patterns on ECS. It's perhaps too opinionated to be able to deploy Airflow.
My suggestion is to go through the blogs above and get inspired.

Related

Docker-compose file of Airflow with DaskExecutor

Can someone provide a YAML file of the same mentioned above? I need it for a project.
I am trying to execute my tasks parallelly on each core of the workers, as it provides a performance boost for the task.
To achieve this I want to execute my Airflow tasks directly on the Dask cluster. As my project requires Airflow to be run on docker, I couldn't find any docker-compose.yaml file for Airflow with DaskExecutor.
Dask generally has a scheduler and some workers in its cluster.
Apart from this, I've tried to achieve this task parallelism with the airflow-provider-ray library by Astronomer registry. I've used this documentation to achieve so in docker. But I am facing OSError: Connection timeout. Here I am running airflow in docker and ray cluster on my local python environment.
Secondly, I've tried the same with the dask cluster. In this, there is Airflow running on docker with celery executor, and in another docker, there is Dask scheduler, two workers, and a notebook. Then I am able to connect these but I keep getting error - ModuleNotFoundError: No module named 'unusual_prefix_2774d32fcb40c2ba2f509980b973518ede2ad0c3_dask_client'.
The solution to any of these problems will be appreciated.

Remote execution with on a Single node with Multiple GPU

I am looking into documentation for running hydra on a single node remotely. I am looking for methods where I can run a code present in my local machine and to run it on a GCP instance.
Any pointers?
It sounds like you are looking for a Hydra Launcher that supports GCP.
For now, Hydra does not support this. We do have a Ray Launcher that launches to AWS and could be further extended to launching on GCP. Feel free to subscribe this issue.

Connect dbt to Airflow using EKS

We currently deployed helm Airflow in AWS EKS and want to trigger dbt models from it.
A few questions:
1. What would be the ideal way to deploy dbt? I am thinking about deploying another container for dbt only or installing dbt in the same container running Airflow via pip or brew.
2. If the ideal way to run dbt is in its own container, how do I connect airflow to dbt?
Please feel free to add any relevant information!
I think you should consider switching to the Official chart that he Apache Airflow community released recently: https://airflow.apache.org/docs/helm-chart/stable/index.html - it is prepared and maintained by the same community that builds Airflow.
I think one of the best descriptions about how to integrate dbt you can find in this Astronomer's blog: https://www.astronomer.io/blog/airflow-dbt-1
Summarising it - if you do not want to use dbt cloud, you can install dbt as pip package, and either run it via Bash script, or using dedicated DBT operators. If you already use Airflow through image, connecting dbt image to it when it should be invoked in another image, while technically possible, is a bit challenging and likely not worth the hassle.
You should simply extend the Airflow Image and add dbt as pip package. You can learn how to extend or customize the Airflow Image here: https://airflow.apache.org/docs/docker-stack/build.html
Small follow-up. Not sure if you have seen the talk from last week's Airflow Summit but I highly recommend it: https://airflowsummit.org/sessions/2021/building-a-robust-data-pipeline-with-the-dag-stack/ - it might give you a bit more answers :)

Best way to distribute code to airflow webserver / scheduler + workers and workflow

What do people find is the best way to distribute code (dags) to airflow webserver / scheduler + workers? I am trying to run celery on a large cluster of workers such that any manual updates are impractical.
I am deploying airflow on docker and using s3fs right now and it is crashing on me constantly and creating weird core.### files. I am exploring other solutions (ie StorageMadeEasy, DropBox, EFS, a cron job to update from git...) but would love a little feedback as I explore solutions.
Also how do people typically make updates to dags and distribute that code? If one uses a share volume like s3fs, every time you update a dag do you restart the scheduler? Is editing the code in place on something like DropBox asking for trouble? Any best practices on how update dags and distribute the code would be much appreciated.
I can't really tell you what the "best" way of doing it is but I can tell you what I've done when I needed to distribute the workload onto another machine.
I simply set up an NFS share on the airflow master for the both the DAGS and the PLUGINS folders and mounted this share onto the worker. I had an issue once or twice where the NFS mount point would break for some reason but after re-mounting the jobs continued.
To distribute the DAG and PLUGIN code to the Airflow cluster I just deploy it to the master (I do this by bash script on my local machine which just SCPs the folders up from my local git branch) and NFS handles the replication to the worker. I always restart everything after a deploy, I also don't deploy while a job is running.
A better way to deploy would be to have GIT on the airflow master server which checks out a branch from a GIT repository (test or master depending on the airflow server?) and then replace the dags and plugins with the ones in the git repository. I'm experimenting with doing deployments like this at the moment with Ansible.

How to scale Docker containers in production

This question's answers are a community effort. Edit existing answers to improve this post. It is not currently accepting new answers or interactions.
So I recently discovered this awesome tool, and it says
Docker is an open-source project to easily create lightweight,
portable, self-sufficient containers from any application. The same
container that a developer builds and tests on a laptop can run at
scale, in production, on VMs, bare metal, OpenStack clusters, public
clouds and more.
Let's say I have a docker image which runs Nginx and a website connects to external database. How do I scale the container in production?
Update: 2019-03-11
First of all thanks for those who have upvoted this answer over the years.
Please be aware that this question was asked in August 2013, when Docker was still a very new technology. Since then: Kubernetes was launched on June 2014, Docker swarm was integrated into the Docker engine in Feb 2015, Amazon launched it's container solution, ECS, in April 2015 and Google launched GKE in August 2015. It's fair to say the production container landscape has changed substantially.
The short answer is that you'd have to write your own logic to do this.
I would expect this kind of feature to emerge from the following projects, built on top of docker, and designed to support applications in production:
flynn
deis
coreos
Mesos
Update 1
Another related project I recently discovered:
maestro
Update 2
The latest release Openstack contains support for managing Docker containers:
Docker Openstack
Paas zone within OpenStack
Update 3
System for managing Docker instances
Shipyard
And a presentation on how to use tools like Packer, Docker and Serf to deliver an immutable server infrastructure pattern
FutureOps with Immutable Infrastructure
Slides
Update 4
A neat article on how to wire together docker containers using serf:
Decentralizing Docker: How to use serf with Docker
Update 5
Run Docker on Mesos using the Marathon framework
Mesosphere Docker Developer Tutorial
Update 6
Run Docker on Tsuru as it supports docker-cluster and segregated scheduler deploy
http://blog.tsuru.io/2014/04/04/running-tsuru-in-production-scaling-and-segregating-docker-containers/
Update 7
Docker-based environments orchestration
maestro-ng
Update 8
decking.io
Update 9
Google kubernetes
Update 10
Redhat have refactored their openshift PAAS to integrate Docker
Project Atomic
Geard
Update 11
A Docker NodeJS lib wrapping the Docker command line and managing it from a json file.
docker-cmd
Update 12
Amazon's new container service enables scaling in the cluster.
Update 13
Strictly speaking Flocker does not "scale" applications, but it is designed to fufil a related function of making stateful containers (running databases services?) portable across multiple docker hosts:
https://clusterhq.com/
Update 14
A project to create portable templates that describe Docker applications:
http://panamax.io/
Update 15
The Docker project is now addressing orchestration natively (See announcement)
Docker machine
Docker swarm
Docker compose
Update 16
Spotify Helios
See also:
https://blog.docker.com/tag/helios/
Update 17
The Openstack project now has a new "container as a service" project called Magnum:
https://wiki.openstack.org/wiki/Magnum
Shows a lot of promise, enables the easy setup of Docker orchestration frameworks like Kubernetes and Docker swarm.
Update 18
Rancher is a project that is maturing rapidly
http://rancher.com/
Nice UI and strong focus on hyrbrid Docker infrastructures
Update 19
The Lattice project is an offshoot of Cloud Foundry for managing container clusters.
Update 20
Docker recently bought Tutum:
https://www.docker.com/tutum
Update 21
Package manager for applications deployed on Kubernetes.
http://helm.sh/
Update 22
Vamp is an open source and self-hosted platform for managing (micro)service oriented architectures that rely on container technology.
http://vamp.io/
Update 23
A Distributed, Highly Available, Datacenter-Aware Scheduler
https://www.nomadproject.io/
From the guys that gave us Vagrant and other powerful tools.
Update 24
Container hosting solution for AWS, open source and based on Kubernetes
https://supergiant.io/
Update 25
Apache Mesos based container hosted located in Germany
https://sloppy.io/features/#features
And Docker Inc. also provide a container hosting service called Docker cloud
https://cloud.docker.com/
Update 26
Jelastic is a hosted PAAS service that scales containers automatically.
Deis automates scaling of Docker containers (among other things).
Deis (pronounced DAY-iss) is an open source PaaS that makes it easy to deploy and manage applications on your own servers. Deis builds upon Docker and CoreOS to provide a lightweight PaaS with a Heroku-inspired workflow.
Here is the developer workflow:
deis create myapp # create a new deis app called "myapp"
git push deis master # built with a buildpack or dockerfile
deis scale web=16 worker=4 # scale up docker containers
Deis automatically deploys your Docker containers across a CoreOS cluster and configures the Nginx routers to route requests to healthy Docker containers. If a host dies, containers are automatically restarted on another host in seconds. Just browse to the proxy URL or use deis open to hit your app.
Some other useful commands:
deis config:set DATABASE_URL= # attach to a database w/ an envvar
deis run make test # run ephemeral containers for one-off tasks
deis logs # get aggregated logs for troubleshooting
deis rollback v23 # rollback to a prior release
To see this in action, check out the terminal video at http://deis.io/overview/. You can also learn about Deis concepts or jump right into deploying your own private PaaS.
You can try Tsuru. Tsuru is a opensource PaaS inspired in Heroku, and it is already with some products in production at Globo.com(internet arm of the biggest Broadcast Television Company in Brazil)
It manages the entire flow of an application, since the container creation, deploy, routing(with hipache) with many nice features as docker cluster, scaling of units, segregated deploy, etc.
Take a look in our documentation bellow:
http://docs.tsuru.io/
Here our post covering our environment:
http://blog.tsuru.io/2014/04/04/running-tsuru-in-production-scaling-and-segregating-docker-containers/
Have a look at Rancher.com - it can manage multiple Docker hosts and much more.
A sensible approach to scaling Docker could be:
Each service will be a docker container
Intra container service discovery managed through links (new feature from docker 0.6.5)
Containers will be deployed through Dokku
Applications will be managed through Shipyard which in its turn is using hipache
Another docker open sourced project from Yandex:
cocaine
Openshift guys also created a project. You can find more information here, try test container and detailed info here .
The only problem is the solution is Redhat centric for now :)
While we're big fans of Deis (deis.io) and are actively deploying to it, there are other Heroku like PaaS style deployment solutions out there, including:
Longshoreman from the Wayfinder folks:
https://github.com/longshoreman/longshoreman
Decker from the CloudCredo folks, using CloudFoundry:
http://www.cloudcredo.com/decker-docker-cloud-foundry/
As for straight up orchestration, NewRelic's opensource Centurion project seems quite promising:
https://github.com/newrelic/centurion
Take a look also at etcd and Consul.
Panamax: Docker Management for Humans. panamax.io
Fig: Fast, isolated development environments using Docker. fig.sh
One option not mentioned in other posts is Helios. It is built by spotify and does not try to do too much.
https://github.com/spotify/helios

Resources