Every time our team puts another requirements.txt file for our MWAA environment, it requires a restart.
Regardless of the environment being in a PENDING or UPDATING state, I can still access the UI and run/monitor DAGS. I expect something to at least be unavailable or locked during this process from a user perspective.
So, my questions are: in the MWAA way of things, what exactly is being "restarted" during this process and why is applied to the entire so-called MWAA environment?
Airflow DAG processor, Airflow workers and Airflow scheduler are reboot
but not Airflow web server
This can be confirmed checking their respective logs.
Beware, some long-running task can fail during a reboot.
I have a composer-2.0.25-airflow-2.2.5. I need to update the number of workers and environment variables in an environment that is already running. After update the environment the sheduler monitoring is unhealthy and the pod continues restarting alone. Sometimes appears the CrashLoopBackOff so indicates that a container is repeatedly crashing after restarting.
I looked the info of the pod where I saw the scheduler restarts.
I need the environment to continue running after the updates.
Do you have any idea about this issue?
We currently have a bunch of independent jobs running on different servers & being scheduled with crontab. The goal would be to have a single view of all the jobs across the servers and whether they've run successfully etc.
Airflow is one of the tools we are considering using to achieve this. But our servers are configured very differently. Is it possible to set up airflow so that DAG1 (and the airflow scheduler & webserver) runs on server1 and DAG2 runs on server2 without RabbitMQ.
Essentially I'd like to achieve something like the first answer given here (or just at a DAG level): Airflow DAG tasks parallelism on different worker nodes
in the quickest & simplest way possible!
You can checkout Running Apache-Airflow with Celery Executor in Docker.
To use celery, you can instantiate a redis node as a pod and proceed with managing tasks across multiple hosts.
The link above will also give you a starter docker-compose yaml to help you get started quickly with Apache Airflow on celery executor.
Is it possible to set up airflow so that DAG1 (and the airflow
scheduler & webserver) runs on server1 and DAG2 runs on server2
without RabbitMQ.
Airflow by default will try to use multiple hosts on Celery Executor and the division will always be on task level and not on DAG level.
This post might help you with spawning specific tasks on a specific worker node.
I have an Airflow environment running on Cloud Composer (3 n1-standard-1 nodes; image version: composer-1.4.0-airflow-1.10.0; config override: core catchup_by_default=False; PyPI packages: kubernetes==8.0.1).
During a DAG run, a few tasks (all GKEPodOperators) failed due to airflow worker pod eviction. All of these tasks were set to retries=0. One of them was requeued and retried. Why would this happen when the task is set to 0 retries? And why would it only happen to one of the tasks?
"airflow worker pod eviction" means that some pods needed more resources hence some pods were evicted.
To fix this you can use larger machine types or try to reduce the DAGs memory consumption.
Review his document to have a better view.
We have an Airflow deployment with Celery executors.
Many of our DAGs require a local processing step of some file in a BashOperator or PythonOperator.
However, in our understanding the tasks of a given DAG may not always be scheduled on the same machine.
The options for state sharing between tasks I've gathered so far:
Use Local Executors - this may suffice for one team, depending on the load, but may not scale to the wider company
Use XCom - does this have a size limit? Probably unsuitable for large files
Write custom Operators for every combination of tasks that need local processing in between. This approach reduces modularity of tasks and requires replicating existing operators' code.
Use Celery queues to route DAGs to the same worker (docs) - This option seems attractive at first, but what would be an appropriate way to set it up in order to avoid routing everything to one executor, or crafting a million queues?
Use a shared network storage in all machines that run executors - Seems like an additional infrastructure burden, but is a possibility.
What is the recommended way to do sharing of large intermediate state, such as files, between tasks in Airflow?
To clarify something: No matter how you setup airflow, there will only be one executor running.
The executor runs on the same machine as the scheduler.
Currently (current is airflow 1.9.0 at time of writing) there is no safe way to run multiple schedulers, so there will only ever be one executor running.
Local executor executes the task on the same machine as the scheduler.
Celery Executor just puts tasks in a queue to be worked on the celery workers.
However, the question you are asking does apply to Celery workers. If you use Celery Executor you will probably have multiple celery workers.
Using network shared storage solves multiple problems:
Each worker machine sees the same dags because they have the same dags folder
Results of operators can be stored on a shared file system
The scheduler and webserver can also share the dags folder and run on different machines
I would use network storage, and write the output file name to xcom. Then when you need to input the output from a previous task, you would read the file name from that task's Xcom and process that file.
Change datatype of column key in xcom table of airflow metastore.
Default datatype of key is: blob.
Change it to LONGBLOB. It will help you to store upto 4GB in between intermediate tasks.
This seems like a mundane question but just to be on the safe side,
what are the effects of restarting the airflow service on the jobs which are currently running?
If you only restart the airflow webserver/scheduler processes then the running jobs are not affected. However restarting the worker process kills the job (killed as zombie - http://airflow.incubator.apache.org/concepts.html#zombies-undeads) and then it may or may not be retried accordingly to the dag/task rules.