Testing airflow dags with KubernetesPodOperator - airflow

We have a sort of self-serve Airflow cluster that mandates that all of the tasks are wrapped as KubernetesPodOperator tasks. With this setup what are the possible options to test the dags in CI/CD setup?

Related

if the AirFlow scheduler crash, can AirFlow restart the in progress job in another scheduler container?

If the AirFlow scheduler crash, can AirFlow restart the in progress job in another scheduler container? Or, do it have to rerun the job from the beginning?
I am considering to use AirFlow to implement on-demand nearline processing and wish to know the reliability aspects. But I could not confirm this point from the docs.

Not receiving airflow email-alert from airflow

In our airflow there are many components some based on python and some are based on pyspark. as airflow does not support pyspark yet so we are connecting those components from Airflow to EMR . We are receiving mails from EMR but python components from airflow level we are not getting any email-alert.so is there any method to add config at runtime in Airflow.

Running airflow DAG/tasks on different hosts

We currently have a bunch of independent jobs running on different servers & being scheduled with crontab. The goal would be to have a single view of all the jobs across the servers and whether they've run successfully etc.
Airflow is one of the tools we are considering using to achieve this. But our servers are configured very differently. Is it possible to set up airflow so that DAG1 (and the airflow scheduler & webserver) runs on server1 and DAG2 runs on server2 without RabbitMQ.
Essentially I'd like to achieve something like the first answer given here (or just at a DAG level): Airflow DAG tasks parallelism on different worker nodes
in the quickest & simplest way possible!
Thanks
You can checkout Running Apache-Airflow with Celery Executor in Docker.
To use celery, you can instantiate a redis node as a pod and proceed with managing tasks across multiple hosts.
The link above will also give you a starter docker-compose yaml to help you get started quickly with Apache Airflow on celery executor.
Is it possible to set up airflow so that DAG1 (and the airflow
scheduler & webserver) runs on server1 and DAG2 runs on server2
without RabbitMQ.
Airflow by default will try to use multiple hosts on Celery Executor and the division will always be on task level and not on DAG level.
This post might help you with spawning specific tasks on a specific worker node.

Managed AWS Airflow with ECS Fargate DAG Jobs

Reading this article: https://aws.amazon.com/blogs/containers/running-airflow-on-aws-fargate/
It isn't clear to me; is it possible to run MWAA and execute jobs on ECS Fargate? Or to execute jobs on Ecstatics Fargate do you need to run then entire Airflow stack on ECS Fargate?
I'd recommend reading this document on AWS MWAAs, specifically the section on Architecture as it should provide you with more context.
It isn't clear to me; is it possible to run MWAA and execute jobs on ECS Fargate?
Yes. A MWAA runs it's Airflow components (scheduler, worker etc) on Fargate and will automatically execute it's jobs in Fargate containers. It will also scale the number of containers to meet demand.
There are also a plethora of Airflow integrations out there that you can use to offload the tasks/nodes within a DAG to other services (such as ECS, Batch etc.)
It is not well documented but it is possible. We are successfully running MWAA with ECS task operator and custom images.
Basically you'll need the following:
MWAA environment
MWAA execution role with added permissions to run tasks in ECS and access CloudWatch logs
ECS Tasks definitions
You'll also need to add apache-airflow[amazon] in MWAA Requirements file.

Is there a way to view a list of all airflow workers?

New to Airflow, so apologies if this question doesn't really make sense. Is there a command or place in the webserver UI where one can see a list of all running workers? Also, if a Celery worker node is not explicitly started with airflow worker, are there "default" workers that are initialized with either the webserver or scheduler?
Reguarding this part of your question:
Also, if a Celery worker node is not explicitly started with airflow worker, are there "default" workers that are initialized with either the webserver or scheduler?
Take a look at the docs on executors.
Using celery requires some configuration changes. If you haven't configured airflow to use celery, then even if you start a celery worker, the worker won't pick up any tasks.
Conversely, if you have configured airflow to use celery, and you have not started any celery workers, then your cluster will not execute a single task.
If you are using SequentialExecutor (the default) or LocalExecutor (requires configuration), with both of these executors, tasks are executed by the scheduler, and no celery workers are used (and if you spun some up, then they wouldn't execute any tasks).
Regarding this:
Is there a way to view a list of all airflow workers?
If you have configured airflow to use celery, then you can run flower to see monitoring of celery workers. In airflow >= 2.0.0 flower is launched with airflow celery flower.

Resources