How to run DockerOperator when Airflow is already a docker container? - airflow

I currently have an Airflow instance running with docker-compose. In the future, I will be moving to a kubernetes cluster. So, Airflow will always be running in a docker container.
That being said, how do I run a DockerOperator when Airflow itself is inside a docker container?
It's a docker-in-docker inception problem that I don't fully understand how to mitigate.
Thanks!

You could map docker socket inside the container (/var/run/docker.sock) configure docker engine URL to connect to external Docker engine (Docker architecture is done in the way that you actually run client locally and the engine running your docker container runs elsewhere.
Just look for Docker-in-Docker term.
There is currently (as of 2.0.0 Docker Provider) a bug preventing doing that, but it is being addressed in https://github.com/apache/airflow/pull/16932 and it will be out in < week - or you can downgrade to previous version of the provider.

Related

Arflow in ECS cluster

I have previously used Airflow running Ubuntu VM. I was able to login into VM via WinSCP and putty to run commands and edit Airflow related files as required.
But i first time came across Airflow running AWS ECS cluster. I am new to ECS So, i am trying to see what is the best possible way to :
run commands like "airflow dbinit", stop/start web-server and scheduler etc...
Install new python packages like "pip install "
View and edit Airflow files in ECS cluster
I was reading about AWS CLI and ECS cli could they be helpful ? or is there is any other best possible way that lets me do above mentioned actions.
Thanks in Advance.
Kind Regards
P.
There are many articles that describe how to run Airflow on ECS:
https://tech.gc.com/apache-airflow-on-aws-ecs/
https://towardsdatascience.com/building-an-etl-pipeline-with-airflow-and-ecs-4f68b7aa3b5b
https://aws.amazon.com/blogs/containers/running-airflow-on-aws-fargate/
many more
[ Note: Fargate allows you to run containers via ECS without a need to have EC2 instances. More here if you want additional background re what Fargate is]
Also, AWS CLI is a generic CLI that maps all AWS APIs (mostly 1:1). You may consider it for your deployment (but it should not be your starting point).
The ECS CLI is a more abstracted CLI that exposes higher level constructs and workflows that are specific to ECS. Note that the ECS CLI has been superseded by AWS Copilot which is an opinionated CLI to deploy workloads patterns on ECS. It's perhaps too opinionated to be able to deploy Airflow.
My suggestion is to go through the blogs above and get inspired.

Using DockerOperator with CeleryExecutor in airflow DAG

at this time I use the LocalExecutor with airflow. My DAG is using docker images with the DockerOperator coming with airflow. For this the docker images must be present on the PC. If I want to use a distributed executor like CeleryExecutor or KubernetesExecutor the docker images must be present on all the machines which are part of the Celery or Kubernetes cluster?
Regards
Oli
That is correct. Since airflow workers run tasks locally, you will require to have docker images or nay other resources available locally on the worker. You could try this link to setup local docker registry which can serve docker images and save you efforts of maintaining them manually.

Docker, and small production server infrastructure advices

I'm figuring out how to setup my production server the best way, but i'm a little bit stuck about how to do it correctly:
Currently, all my web applications are dockerified, i have:
One nginx front container, that route request to several backend containers:
One Symfony App
Two Wordpress blog
One NodeJS App
One MySql container for DB storage
One MongoDB container too
ALL this infrastructure is started using docker-compose.
This works fine but it sounds too much "monolitihic" for me:
I cannot stop one container without restarting all the others.
I cannot add other web applications without restarting everything
I have no way to restart container automatically after a crash...
This is the first time i'm doing this, do you know some best practices or softwares that can help me to improve my production server ?
Thanks a lot !
I cannot stop one container without restarting all the others.
What prevents you from using the docker stop command instead of the docker-compose stop command when you want to stop only one container?
I cannot add other web applications without restarting everything
I would suggest the use of the excellent jwilder/nginx-proxy nginx docker image to act as a reverse proxy in front of your other containers. This reverse proxy will adapt to the running/stopped containers. You can add a container later on and this reverse proxy will automatically route traffic to it depending on the domain name.
I have no way to restart container automatically after a crash...
Take a look at the restart: directive for the docker-compose.yml file.
The "monolithic" view of docker-compose is indeed made to allow you to manage your application stack in one way. But one needs to know that docker-compose is a "layer" on top of docker which (docker) you can still use.
As #thomasleveil says, you still can manipulate docker-compose created containers individually with docker.
$ docker exec project_web_1 ls -l /
$ docker stop project_db_1
$ docker up -d project_nginx_1
$ ...
In another hand I suggest to rely more on docker-compose which also allows to act on individual containers, separate your different applications or environments, and is aware of the dependencies between containers (not being exhaustive).
$ docker-compose exec web ls -l /
$ docker-compose stop db
$ docker-compose up -d nginx
$ ...
Booting up a new service is also very easy with docker-compose, since it can detect things based on your yml config without stopping anything if not needed.
$ docker-compose up -d
project_web_1 is up-to-date
project_db_1 is up-to-date
Creating project_newservice_1
I also find the help of a reverse proxy very useful for production installations. However I would more suggest the brand new Traefik which brings nice features like hot-reloading, service discovery, automated SSL certification with Letsencrypt and renewal (not being exhaustive).

Dart lang app with open stack / docker / vagrant

I'm newbie for these techs (open stack / docker / vagrant), not sure if I understood them correctly (most likely did not), for me I understood it is something like having a portable application to run it with same development configuration to ensure all the development team have same setup, but did not understand, what after development, and how to get benefit from them with dart app.
my question is:
1. Correct my understanding
2. Do I need the end user to have these things installed in his system, and run my application through them, same as in the development stage?
3. How can I build/develop/distribute dart lang app through them, may be as hese as well as dart are new, I could not find enough info while googling.
thanks
Docker is similar to a virtual machine like VM-Ware or Virtualbox as it creates an abstraction layer between the host operating system and the operating system running within a Docker container. The difference is that Docker doesn't emulate the entire hardware. The disadvantage is that Docker only runs on Linux and only Linux can be run inside Docker. If your host is an Intel system you can't run an ARM Linux inside the container. (theoretically you can run Virtualbox inside Docker and run Windows. or other OSes in it)
With Docker you can test your application locally in the same environment as the application will run when deployed.
When you for example create an application you want to run in Google Compute Engine you install and test it locally inside a Docker container and then deploy the Docker container to Google Compute Engine as a whole unit. When there is a bug in the deployed application you should be able to reproduce it locally as well because it's just a 1:1 copy. No bug could have been introduce because the operating system or other dependencies were installed differently on the deployment environment than in the develeopment/test environment.
The Dockerfile is a set of instructions to set up a Docker container. If you want to create a new Docker container (for example for a new developer) you just let Docker process the Dockerfile and a new Docker container is created from it. This allows to easily create new Containers.
If you want to update one dependency to a newer version or want to add remove components to/from the environment you change the Dockerfile and create a new container from it. This way you avoid that manual addition/removal form/to an existing container manually lets containers of different developers/testers/deployment diverge from each other.
I haven't used OpenStack myself but from the web page it seems to provide components and tools to build and manage your own cloud infrastructure.
I also haven't used Vagrant myself but it seems to help to automate a lot of tasks related to creating and managing virtual machines like VM-Ware, Virtualbox, Docker and probably others.
When you have for example a server application it probably consist of a number of components you don't want all to run in one container but split up into several containers. One container for the Database, one for the web server, one for the backend application (created in Dart for example), and others. It can become cumbersome to manage all those containers. Vagrant helps to automate related tasks.

How to scale Docker containers in production

This question's answers are a community effort. Edit existing answers to improve this post. It is not currently accepting new answers or interactions.
So I recently discovered this awesome tool, and it says
Docker is an open-source project to easily create lightweight,
portable, self-sufficient containers from any application. The same
container that a developer builds and tests on a laptop can run at
scale, in production, on VMs, bare metal, OpenStack clusters, public
clouds and more.
Let's say I have a docker image which runs Nginx and a website connects to external database. How do I scale the container in production?
Update: 2019-03-11
First of all thanks for those who have upvoted this answer over the years.
Please be aware that this question was asked in August 2013, when Docker was still a very new technology. Since then: Kubernetes was launched on June 2014, Docker swarm was integrated into the Docker engine in Feb 2015, Amazon launched it's container solution, ECS, in April 2015 and Google launched GKE in August 2015. It's fair to say the production container landscape has changed substantially.
The short answer is that you'd have to write your own logic to do this.
I would expect this kind of feature to emerge from the following projects, built on top of docker, and designed to support applications in production:
flynn
deis
coreos
Mesos
Update 1
Another related project I recently discovered:
maestro
Update 2
The latest release Openstack contains support for managing Docker containers:
Docker Openstack
Paas zone within OpenStack
Update 3
System for managing Docker instances
Shipyard
And a presentation on how to use tools like Packer, Docker and Serf to deliver an immutable server infrastructure pattern
FutureOps with Immutable Infrastructure
Slides
Update 4
A neat article on how to wire together docker containers using serf:
Decentralizing Docker: How to use serf with Docker
Update 5
Run Docker on Mesos using the Marathon framework
Mesosphere Docker Developer Tutorial
Update 6
Run Docker on Tsuru as it supports docker-cluster and segregated scheduler deploy
http://blog.tsuru.io/2014/04/04/running-tsuru-in-production-scaling-and-segregating-docker-containers/
Update 7
Docker-based environments orchestration
maestro-ng
Update 8
decking.io
Update 9
Google kubernetes
Update 10
Redhat have refactored their openshift PAAS to integrate Docker
Project Atomic
Geard
Update 11
A Docker NodeJS lib wrapping the Docker command line and managing it from a json file.
docker-cmd
Update 12
Amazon's new container service enables scaling in the cluster.
Update 13
Strictly speaking Flocker does not "scale" applications, but it is designed to fufil a related function of making stateful containers (running databases services?) portable across multiple docker hosts:
https://clusterhq.com/
Update 14
A project to create portable templates that describe Docker applications:
http://panamax.io/
Update 15
The Docker project is now addressing orchestration natively (See announcement)
Docker machine
Docker swarm
Docker compose
Update 16
Spotify Helios
See also:
https://blog.docker.com/tag/helios/
Update 17
The Openstack project now has a new "container as a service" project called Magnum:
https://wiki.openstack.org/wiki/Magnum
Shows a lot of promise, enables the easy setup of Docker orchestration frameworks like Kubernetes and Docker swarm.
Update 18
Rancher is a project that is maturing rapidly
http://rancher.com/
Nice UI and strong focus on hyrbrid Docker infrastructures
Update 19
The Lattice project is an offshoot of Cloud Foundry for managing container clusters.
Update 20
Docker recently bought Tutum:
https://www.docker.com/tutum
Update 21
Package manager for applications deployed on Kubernetes.
http://helm.sh/
Update 22
Vamp is an open source and self-hosted platform for managing (micro)service oriented architectures that rely on container technology.
http://vamp.io/
Update 23
A Distributed, Highly Available, Datacenter-Aware Scheduler
https://www.nomadproject.io/
From the guys that gave us Vagrant and other powerful tools.
Update 24
Container hosting solution for AWS, open source and based on Kubernetes
https://supergiant.io/
Update 25
Apache Mesos based container hosted located in Germany
https://sloppy.io/features/#features
And Docker Inc. also provide a container hosting service called Docker cloud
https://cloud.docker.com/
Update 26
Jelastic is a hosted PAAS service that scales containers automatically.
Deis automates scaling of Docker containers (among other things).
Deis (pronounced DAY-iss) is an open source PaaS that makes it easy to deploy and manage applications on your own servers. Deis builds upon Docker and CoreOS to provide a lightweight PaaS with a Heroku-inspired workflow.
Here is the developer workflow:
deis create myapp # create a new deis app called "myapp"
git push deis master # built with a buildpack or dockerfile
deis scale web=16 worker=4 # scale up docker containers
Deis automatically deploys your Docker containers across a CoreOS cluster and configures the Nginx routers to route requests to healthy Docker containers. If a host dies, containers are automatically restarted on another host in seconds. Just browse to the proxy URL or use deis open to hit your app.
Some other useful commands:
deis config:set DATABASE_URL= # attach to a database w/ an envvar
deis run make test # run ephemeral containers for one-off tasks
deis logs # get aggregated logs for troubleshooting
deis rollback v23 # rollback to a prior release
To see this in action, check out the terminal video at http://deis.io/overview/. You can also learn about Deis concepts or jump right into deploying your own private PaaS.
You can try Tsuru. Tsuru is a opensource PaaS inspired in Heroku, and it is already with some products in production at Globo.com(internet arm of the biggest Broadcast Television Company in Brazil)
It manages the entire flow of an application, since the container creation, deploy, routing(with hipache) with many nice features as docker cluster, scaling of units, segregated deploy, etc.
Take a look in our documentation bellow:
http://docs.tsuru.io/
Here our post covering our environment:
http://blog.tsuru.io/2014/04/04/running-tsuru-in-production-scaling-and-segregating-docker-containers/
Have a look at Rancher.com - it can manage multiple Docker hosts and much more.
A sensible approach to scaling Docker could be:
Each service will be a docker container
Intra container service discovery managed through links (new feature from docker 0.6.5)
Containers will be deployed through Dokku
Applications will be managed through Shipyard which in its turn is using hipache
Another docker open sourced project from Yandex:
cocaine
Openshift guys also created a project. You can find more information here, try test container and detailed info here .
The only problem is the solution is Redhat centric for now :)
While we're big fans of Deis (deis.io) and are actively deploying to it, there are other Heroku like PaaS style deployment solutions out there, including:
Longshoreman from the Wayfinder folks:
https://github.com/longshoreman/longshoreman
Decker from the CloudCredo folks, using CloudFoundry:
http://www.cloudcredo.com/decker-docker-cloud-foundry/
As for straight up orchestration, NewRelic's opensource Centurion project seems quite promising:
https://github.com/newrelic/centurion
Take a look also at etcd and Consul.
Panamax: Docker Management for Humans. panamax.io
Fig: Fast, isolated development environments using Docker. fig.sh
One option not mentioned in other posts is Helios. It is built by spotify and does not try to do too much.
https://github.com/spotify/helios

Resources