I am a newbie to airflow so pardon me for any stupid assumptions I make about it, I have ETL set up at my work where I am running Airflow on company cluster and have a dag with few tasks in it. It is a possible scenario that the cluster on which airflow runs crashes , in that event the DAG will not run.
I wanted to check if we can set up notification on failure of airflow scheduler , my online reading has thrown up several useful articles to monitor the DAG itself , but if the scheduler fails then these failure notifications won't be triggered (correct me if thats not how it works)
Open the below link in incognito if you face firewall and don't have subscription
https://medium.com/datareply/integrating-slack-alerts-in-airflow-c9dcd155105
You have to use an external software for this, like Datadog.
Here you can find more information:
https://docs.datadoghq.com/integrations/airflow/?tab=host
Basically, you have to connect externally Datadog to Airflow through statsD.
In my case, I have Airflow deployed through docker-compose, and Datadog is another container (from Datadog official Docker image), linked to the scheduler and webserver containers.
You can also use Grafana and Prometeus (also through statsD), which is the Open Source way https://databand.ai/blog/everyday-data-engineering-monitoring-airflow-with-prometheus-statsd-and-grafana/
Related
Every time our team puts another requirements.txt file for our MWAA environment, it requires a restart.
Regardless of the environment being in a PENDING or UPDATING state, I can still access the UI and run/monitor DAGS. I expect something to at least be unavailable or locked during this process from a user perspective.
So, my questions are: in the MWAA way of things, what exactly is being "restarted" during this process and why is applied to the entire so-called MWAA environment?
Airflow DAG processor, Airflow workers and Airflow scheduler are reboot
but not Airflow web server
This can be confirmed checking their respective logs.
Beware, some long-running task can fail during a reboot.
Context
We are running Airflow 2.1.4. on a AKS cluster. The Airflow metadata database is an Azure managed postgreSQL(8 cpu). We have a DAG that has like 30 tasks, each task use a KubernetesPodOperator (using the apache-airflow-providers-cncf-kubernetes==2.2.0) to execute some container logic. Airflow is configured with the Airflow official HELM chart. The executor is Celery.
Issue
Usually the first like 5 tasks execute successfully (taking like 1 or 2 minute each) and get marked as done (and colored green) in the Airflow UI. The tasks after that are also successfully executed on AKS, but Airflow not marked as completed in Airflow as such. In the end this leads up to this error message and marking the already finished task as a fail:
[2021-12-15 11:17:34,138] {pod_launcher.py:333} INFO - Event: task.093329323 had an event of type Succeeded
...
[2021-12-15 11:19:53,866]{base_job.py:230} ERROR - LocalTaskJob heartbeat got an exception
psycopg2.OperationalError: could not connect to server: Connection timed out
Is the server running on host "psql-airflow-dev-01.postgres.database.azure.com" (13.49.105.208) and accepting
TCP/IP connections on port 5432?
Similar posting
This issue is also described in this post: https://www.titanwolf.org/Network/q/98b355ff-d518-4de3-bae9-0d1a0a32671e/y Where in the post a link to Stackoverflow does not work anymore.
The metadata database (Azure managed postgreSQL) is not overloading. Also the AKS node pool we are using does not show any sign of stress. It seems like the scheduler cannot pick up / detect a finished task after like a couple of tasks have run.
We also looked at several configuration option as stated here
We are looking now for a number of days now to get this solved but unfortunately no success.
Anyone any idea's what the cause could be? Any help is appreciated!
We currently have a bunch of independent jobs running on different servers & being scheduled with crontab. The goal would be to have a single view of all the jobs across the servers and whether they've run successfully etc.
Airflow is one of the tools we are considering using to achieve this. But our servers are configured very differently. Is it possible to set up airflow so that DAG1 (and the airflow scheduler & webserver) runs on server1 and DAG2 runs on server2 without RabbitMQ.
Essentially I'd like to achieve something like the first answer given here (or just at a DAG level): Airflow DAG tasks parallelism on different worker nodes
in the quickest & simplest way possible!
Thanks
You can checkout Running Apache-Airflow with Celery Executor in Docker.
To use celery, you can instantiate a redis node as a pod and proceed with managing tasks across multiple hosts.
The link above will also give you a starter docker-compose yaml to help you get started quickly with Apache Airflow on celery executor.
Is it possible to set up airflow so that DAG1 (and the airflow
scheduler & webserver) runs on server1 and DAG2 runs on server2
without RabbitMQ.
Airflow by default will try to use multiple hosts on Celery Executor and the division will always be on task level and not on DAG level.
This post might help you with spawning specific tasks on a specific worker node.
Recently I have updated my airflow.cfg to enable metrics thought Statsd.
I have injected this settings to airflow.cfg:
I'm injecting this configuration:
AIRFLOW__SCHEDULER__STATSD_ON=True
AIRFLOW__SCHEDULER__STATSD_HOST=HOSTNAME
AIRFLOW__SCHEDULER__STATSD_PORT=9125
AIRFLOW__SCHEDULER__STATSD_PREFIX=airflow
I'm not using standard Statsd service, but Statsd-exporter which use Statsd protocol, so from my knowledge I can point directly Airflow to send metrics to Stats-exporter. By default it works on 9125 port.
After Statsd-exporter receives metrics, Prometheus is able to scrape them in regular manner.
All fine, all good. Moreover I made my mapping file to Statsd-exporter where I use a bit regex but.... my issue is that when I open WEB UI of Statsd-exported (9102 port) I see part of Airflow metrics, but not all of them!.
Documentation says about list of metrics here
For instance I see that Airflow sends: ti_failures, ti_successes, dagbag_size etc. But there are completely no metrics like: dag...duration or executor.open_slots and couple others.
Really big thank you for anyone who ever played with Statsd and Airflow, as I have no clue:(
I recently instrumented airflow and its metrics to be exported from statsd to prometheus.
In my architecture I have airflow running in Kubernetes pods and these specifically are:
scheduler
worker
flower
web
Only scheduler, worker and web have a side-car container to export statsd metrics (let's call these metrics pods).
And the list of metrics you see in official docs (https://airflow.apache.org/metrics.html) are not available to all metric pods.
To adhere to your specific problem, dag...duration are exported by worker node.
I'm currently using airflow on Amazon Web services using EC2 instances. The big issue is that the average usage of the instances are about 2%...
I'd like to use a scalable architecture and creating instances only for the duration of the job and kill it. I saw on the roadmap that AWS BATCH was suppose to be an executor in 2017 but no new about that.
Do you know if it possible to use AWS BATCH as an executor for all airflow jobs ?
Regards,
Romain.
There is no executor, but an operator is available from version 1.10. After you create an Execution Environment, Job Queue and Job Definition on AWS Batch, you can use the AWSBatchOperator to trigger Jobs.
Here is the source code.
Currently there is a SequentialExecutor, a LocalExecutor, a DaskExecutor, a CeleryExecutor and a MesosExecutor. I heard they're working on AIRFLOW-1899 targeted for 2.0 to introduce a KubernetesExecutor. So, looking at Dask and Celery it doesn't seem they support a mode where their workers are created per task. Mesos might, Kubernetes should, but then you'd have to scale the clusters for the workers accordingly to account for turning off the nodes when un-needed.
We did a little work to get a cloud formation setup where celery workers scale out and in based on metrics from cloud-watch of the average cpu load across the tagged workers.
You would need to create a custom Executor (extended from BaseExecutor) capable of submitting and monitoring the AWS Batch jobs. Also may need to create a custom Docker image for the instances.
I found this repository in my case is working quite well https://github.com/aelzeiny/airflow-aws-executors I'm using Batch jobs with FARGATE_SPOT with computation engine.
I'm just struggling with the logging on AWS CloudWatch and the return status in AWS batch but from Airflow perspective is working