Does Airflow restart affect current running jobs? - airflow

This seems like a mundane question but just to be on the safe side,
what are the effects of restarting the airflow service on the jobs which are currently running?

If you only restart the airflow webserver/scheduler processes then the running jobs are not affected. However restarting the worker process kills the job (killed as zombie - http://airflow.incubator.apache.org/concepts.html#zombies-undeads) and then it may or may not be retried accordingly to the dag/task rules.

Related

Airflow scheduler does not start after Google Composer update

I have a composer-2.0.25-airflow-2.2.5. I need to update the number of workers and environment variables in an environment that is already running. After update the environment the sheduler monitoring is unhealthy and the pod continues restarting alone. Sometimes appears the CrashLoopBackOff so indicates that a container is repeatedly crashing after restarting.
I looked the info of the pod where I saw the scheduler restarts.
I need the environment to continue running after the updates.
Do you have any idea about this issue?

mwaa restart functionality for requirements.txt updates

Every time our team puts another requirements.txt file for our MWAA environment, it requires a restart.
Regardless of the environment being in a PENDING or UPDATING state, I can still access the UI and run/monitor DAGS. I expect something to at least be unavailable or locked during this process from a user perspective.
So, my questions are: in the MWAA way of things, what exactly is being "restarted" during this process and why is applied to the entire so-called MWAA environment?
Airflow DAG processor, Airflow workers and Airflow scheduler are reboot
but not Airflow web server
This can be confirmed checking their respective logs.
Beware, some long-running task can fail during a reboot.

if the AirFlow scheduler crash, can AirFlow restart the in progress job in another scheduler container?

If the AirFlow scheduler crash, can AirFlow restart the in progress job in another scheduler container? Or, do it have to rerun the job from the beginning?
I am considering to use AirFlow to implement on-demand nearline processing and wish to know the reliability aspects. But I could not confirm this point from the docs.

Running airflow DAG/tasks on different hosts

We currently have a bunch of independent jobs running on different servers & being scheduled with crontab. The goal would be to have a single view of all the jobs across the servers and whether they've run successfully etc.
Airflow is one of the tools we are considering using to achieve this. But our servers are configured very differently. Is it possible to set up airflow so that DAG1 (and the airflow scheduler & webserver) runs on server1 and DAG2 runs on server2 without RabbitMQ.
Essentially I'd like to achieve something like the first answer given here (or just at a DAG level): Airflow DAG tasks parallelism on different worker nodes
in the quickest & simplest way possible!
Thanks
You can checkout Running Apache-Airflow with Celery Executor in Docker.
To use celery, you can instantiate a redis node as a pod and proceed with managing tasks across multiple hosts.
The link above will also give you a starter docker-compose yaml to help you get started quickly with Apache Airflow on celery executor.
Is it possible to set up airflow so that DAG1 (and the airflow
scheduler & webserver) runs on server1 and DAG2 runs on server2
without RabbitMQ.
Airflow by default will try to use multiple hosts on Celery Executor and the division will always be on task level and not on DAG level.
This post might help you with spawning specific tasks on a specific worker node.

Airflow dag dependencies

I have a airflow dag-1 that runs approximately for week and dag-2 that runs every day for few hours. When the dag-1 is running i cannot have the dag-2 running due to API limit rate (also dag-2 is supposed to run once dag-1 is finished).
Suppose the dag-1 is running already, then dag-2 that is supposed to run everyday fails, is there a way i can schedule the dag dependencies in a right way?
Is it possible to stop dag-1 temporarily(while running) when dag-2 is supposed to start and then run dag-1 again without manual interruption?
One of the best way is to use the defined pool ..
Lets say if you have a pool named: "specefic_pool" and allocate only one slot for it.
Specify the pool name in your dag bash command (instead of default pool, please use newly created pool) By that way you may over come of running both the dags parallel .
This helps whenever Dag1 is running Dag2 will never be triggered until pool is free or if the dag2 picked the pool until dag2 is completed dag1 is not going to get triggered.

Resources