In our airflow there are many components some based on python and some are based on pyspark. as airflow does not support pyspark yet so we are connecting those components from Airflow to EMR . We are receiving mails from EMR but python components from airflow level we are not getting any email-alert.so is there any method to add config at runtime in Airflow.
Related
I am a newbie to airflow so pardon me for any stupid assumptions I make about it, I have ETL set up at my work where I am running Airflow on company cluster and have a dag with few tasks in it. It is a possible scenario that the cluster on which airflow runs crashes , in that event the DAG will not run.
I wanted to check if we can set up notification on failure of airflow scheduler , my online reading has thrown up several useful articles to monitor the DAG itself , but if the scheduler fails then these failure notifications won't be triggered (correct me if thats not how it works)
Open the below link in incognito if you face firewall and don't have subscription
https://medium.com/datareply/integrating-slack-alerts-in-airflow-c9dcd155105
You have to use an external software for this, like Datadog.
Here you can find more information:
https://docs.datadoghq.com/integrations/airflow/?tab=host
Basically, you have to connect externally Datadog to Airflow through statsD.
In my case, I have Airflow deployed through docker-compose, and Datadog is another container (from Datadog official Docker image), linked to the scheduler and webserver containers.
You can also use Grafana and Prometeus (also through statsD), which is the Open Source way https://databand.ai/blog/everyday-data-engineering-monitoring-airflow-with-prometheus-statsd-and-grafana/
Reading this article: https://aws.amazon.com/blogs/containers/running-airflow-on-aws-fargate/
It isn't clear to me; is it possible to run MWAA and execute jobs on ECS Fargate? Or to execute jobs on Ecstatics Fargate do you need to run then entire Airflow stack on ECS Fargate?
I'd recommend reading this document on AWS MWAAs, specifically the section on Architecture as it should provide you with more context.
It isn't clear to me; is it possible to run MWAA and execute jobs on ECS Fargate?
Yes. A MWAA runs it's Airflow components (scheduler, worker etc) on Fargate and will automatically execute it's jobs in Fargate containers. It will also scale the number of containers to meet demand.
There are also a plethora of Airflow integrations out there that you can use to offload the tasks/nodes within a DAG to other services (such as ECS, Batch etc.)
It is not well documented but it is possible. We are successfully running MWAA with ECS task operator and custom images.
Basically you'll need the following:
MWAA environment
MWAA execution role with added permissions to run tasks in ECS and access CloudWatch logs
ECS Tasks definitions
You'll also need to add apache-airflow[amazon] in MWAA Requirements file.
We have a sort of self-serve Airflow cluster that mandates that all of the tasks are wrapped as KubernetesPodOperator tasks. With this setup what are the possible options to test the dags in CI/CD setup?
We are setting up airflow for scheduling/orchestration , currently we have Spark python loads, and non-spark loads in different server and push files to gcp available in another server. Is there an option to decide to which worker nodes the airflow task are submitted? Currently we are using ssh connection to run all work loads. Our processing is mostly on-perm
Usage is celery executor model, How to we make sure that a specific task is run on its appropriate node.
task run a non spark server ( no spark binaries available)
task 2 executes PySpark submit - (This has spark binaries)
Task Push the files created from task 2 from another server/nodes ( Only this has the gcp utilities installed to push the files due to security reason ) .
If create a dag, is it possible to mention the task to execute on set of worker nodes ?
Currently we are having wrapper shell script for each task and making 3 ssh runs to complete these process. We would like to avoid such wrapper shell script rather use the inbuild have pythonOperator , SparkSubmitOperator, SparkJdbcOperator and SFTPToGCSOperator and make sure the specific task runs in specific server or worknodes .
In short , can we have 3 worker node groups and make the task to execute on a group of nodes based on the operations?
We can assign a queue to each worker node like
Start the airflow worker with mentioning the queue
airflow worker -q sparkload
airflow worker -q non-sparkload
airflow worker -q gcpload
The start each task with queue mentioned. Similar thread found as well.
How can Airflow be used to run distinct tasks of one workflow in separate machines?
New to Airflow, so apologies if this question doesn't really make sense. Is there a command or place in the webserver UI where one can see a list of all running workers? Also, if a Celery worker node is not explicitly started with airflow worker, are there "default" workers that are initialized with either the webserver or scheduler?
Reguarding this part of your question:
Also, if a Celery worker node is not explicitly started with airflow worker, are there "default" workers that are initialized with either the webserver or scheduler?
Take a look at the docs on executors.
Using celery requires some configuration changes. If you haven't configured airflow to use celery, then even if you start a celery worker, the worker won't pick up any tasks.
Conversely, if you have configured airflow to use celery, and you have not started any celery workers, then your cluster will not execute a single task.
If you are using SequentialExecutor (the default) or LocalExecutor (requires configuration), with both of these executors, tasks are executed by the scheduler, and no celery workers are used (and if you spun some up, then they wouldn't execute any tasks).
Regarding this:
Is there a way to view a list of all airflow workers?
If you have configured airflow to use celery, then you can run flower to see monitoring of celery workers. In airflow >= 2.0.0 flower is launched with airflow celery flower.