Airflow sends metrics by statsd but not all of them - airflow

Recently I have updated my airflow.cfg to enable metrics thought Statsd.
I have injected this settings to airflow.cfg:
I'm injecting this configuration:
AIRFLOW__SCHEDULER__STATSD_ON=True
AIRFLOW__SCHEDULER__STATSD_HOST=HOSTNAME
AIRFLOW__SCHEDULER__STATSD_PORT=9125
AIRFLOW__SCHEDULER__STATSD_PREFIX=airflow
I'm not using standard Statsd service, but Statsd-exporter which use Statsd protocol, so from my knowledge I can point directly Airflow to send metrics to Stats-exporter. By default it works on 9125 port.
After Statsd-exporter receives metrics, Prometheus is able to scrape them in regular manner.
All fine, all good. Moreover I made my mapping file to Statsd-exporter where I use a bit regex but.... my issue is that when I open WEB UI of Statsd-exported (9102 port) I see part of Airflow metrics, but not all of them!.
Documentation says about list of metrics here
For instance I see that Airflow sends: ti_failures, ti_successes, dagbag_size etc. But there are completely no metrics like: dag...duration or executor.open_slots and couple others.
Really big thank you for anyone who ever played with Statsd and Airflow, as I have no clue:(

I recently instrumented airflow and its metrics to be exported from statsd to prometheus.
In my architecture I have airflow running in Kubernetes pods and these specifically are:
scheduler
worker
flower
web
Only scheduler, worker and web have a side-car container to export statsd metrics (let's call these metrics pods).
And the list of metrics you see in official docs (https://airflow.apache.org/metrics.html) are not available to all metric pods.
To adhere to your specific problem, dag...duration are exported by worker node.

Related

Airflow scheduler failure notification via slack

I am a newbie to airflow so pardon me for any stupid assumptions I make about it, I have ETL set up at my work where I am running Airflow on company cluster and have a dag with few tasks in it. It is a possible scenario that the cluster on which airflow runs crashes , in that event the DAG will not run.
I wanted to check if we can set up notification on failure of airflow scheduler , my online reading has thrown up several useful articles to monitor the DAG itself , but if the scheduler fails then these failure notifications won't be triggered (correct me if thats not how it works)
Open the below link in incognito if you face firewall and don't have subscription
https://medium.com/datareply/integrating-slack-alerts-in-airflow-c9dcd155105
You have to use an external software for this, like Datadog.
Here you can find more information:
https://docs.datadoghq.com/integrations/airflow/?tab=host
Basically, you have to connect externally Datadog to Airflow through statsD.
In my case, I have Airflow deployed through docker-compose, and Datadog is another container (from Datadog official Docker image), linked to the scheduler and webserver containers.
You can also use Grafana and Prometeus (also through statsD), which is the Open Source way https://databand.ai/blog/everyday-data-engineering-monitoring-airflow-with-prometheus-statsd-and-grafana/

Airflow versus AWS Step Functions for workflow

I am working on a project that grabs a set of input data from AWS S3, pre-processes and divvies it up, spins up 10K batch containers to process the divvied data in parallel on AWS Batch, post-aggregates the data, and pushes it to S3.
I already have software patterns from other projects for Airflow + Batch, but have not dealt with the scaling factors of 10k parallel tasks. Airflow is nice since I can look at which tasks failed and retry a task after debugging. But dealing with that many tasks on one Airflow EC2 instance seems like a barrier. Other option would be to have one task that kicks off the 10k containers and monitors it from there.
I have no experience with Step Functions, but have heard it's AWS's Airflow. There looks to be plenty of patterns online for Step Functions + Batch. Does Step Functions seem like a good path to check out for my use case? Do you get the same insights on failing jobs / ability to retry tasks as you do with Airflow?
I have worked on both Apache Airflow and AWS Step Functions and here are some insights:
Step Functions provide out of the box maintenance. It has high availability and scalability that is required for your use-case, for Airflow we'll have to do to it with auto-scaling/load balancing on servers or containers (kubernetes).*
Both Airflow and Step Functions have user friendly UI's. While Airflow supports multiple representations of the state machine, Step Functions only display state machine as DAG's.
As of version 2.0, Airflow's Rest API is now stable. AWS Step Functions are also supported by a range of production graded cli and SDK's.
Airflow has server costs while Step Functions have 4000/month free step executions (free tier) and $0.000025/step after that. e.g. if you use 10K steps for AWS Batch that run once daily, you will be priced $0.25 per day ($7.5 per month). The price for Airflow server (t2.large ec2 1 year reserved instance) is $41.98 per month. We will have to use AWS Batch for either case.**
AWS Batch can integrate to both Airflow and Step Functions.
You can clear and rerun a failed task in Apache Airflow, but in Step Functions you will have to create a custom implementation to handle that. You may handle automated retries with back-offs in Step Functions definition as well.
For failed task in Step Functions you will get a visual representation of failed state and the detailed message when you click it. You may also use aws cli or sdk to get the details.
Step Functions use easy to use JSON as state machine definition, while Airflow uses Python script.
Step Functions support async callbacks, i.e. state machine pauses until an external source notifies it to resume. While Airflow has yet to add this feature.
Overall, I see more advantages of using AWS Step Functions. You will have to consider maintenance cost and development cost for both services as per your use case.
UPDATES (AWS Managed Workflows for Apache Airflow Service):
*With AWS Managed Workflows for Apache Airflow service, you can offload deployment, maintenance, autoscaling/load balancing and security of your Airflow Service to AWS. But please consider the version number you're willing to settle for, as AWS managed services are mostly behind the latest version. (e.g. As of March 08, 2021, the latest version of open source airflow is 2.01, while MWAA allows version 1.10.12)
**MWAA costs on environment, instance and storage. More details here.
I have used both Airflow and Step Functions in my personal and work projects.
In general I liked step functions but the fact that you need to schedule the execution with Event Bridge is super annoying. Actually I think here Airflow could just act as a triggered for the step functions.
If Airflow would be cheaper to manage, I would always opt for it because I find managing Json based pipelines a hustle whenever you need to detour from the main use case. This always happen for me somehow.This becomes even a more complex issue when you need to have source control.
This one is a bit more subjective assessment but I find the monitoring capability of Airflow far greater than for step functions.
Also some information about the usage of Airflow vs Step functions
Aws currently has managed airflow which is priced per hour and you don’t need to have dedicated ec2. On the other hand step functions are aws lambdas that have an execution time limit of 15min which makes them not the best candidate for a long running pipelines

spring cloud dataflow and airflow

We have airflow as workflow management tool to schedule/monitor tasks and also some have applications using Spring cloud dataflow for loose coupling across processes via producer and consumer talking message bus Kafka and Grafana dashboards for UI (ETL). Kubernetes and AWS (EKS) are options for deployments.
We are starting to creating data pipelines which will have sources( Files on S3 or server or databases), processors( custom applications, AL/ML pipelines ) and destinations (Kafka, s3, databases, ES). I am planning to use airflow to manage overall management of pipelines and tasks within pipeline via SCDF based applications or future applications written python as AL/ML piece expands. Is this correct approach or can I let go of one over other?
Based on your requirements, SCDF would fit and provide options to manage your streaming data pipelines.
While you can still research to find any other possible approaches, I can provide some more hints on what SCDF provides to meet some of your requirements.
SCDF provides out of the box apps which you can extend/customize. These apps include S3 source and sink which you can use out of the box. For a complete list of out of the box apps, you can refer the page here
Apparently, SCDF has Kubernetes deployer which you can work on any Kubernetes based platforms. You can configure your K8s specific properties as a set of kubernetes deployer properties when you deploy the applications.
You can embed a python based application as a processor/transformer in the streaming data pipeline. You can check this receipe from the SCDF site to know more about this.
You can also embed tensorflow application as a processor application inside a pipeline.

AWS Batch executor with Airflow

I'm currently using airflow on Amazon Web services using EC2 instances. The big issue is that the average usage of the instances are about 2%...
I'd like to use a scalable architecture and creating instances only for the duration of the job and kill it. I saw on the roadmap that AWS BATCH was suppose to be an executor in 2017 but no new about that.
Do you know if it possible to use AWS BATCH as an executor for all airflow jobs ?
Regards,
Romain.
There is no executor, but an operator is available from version 1.10. After you create an Execution Environment, Job Queue and Job Definition on AWS Batch, you can use the AWSBatchOperator to trigger Jobs.
Here is the source code.
Currently there is a SequentialExecutor, a LocalExecutor, a DaskExecutor, a CeleryExecutor and a MesosExecutor. I heard they're working on AIRFLOW-1899 targeted for 2.0 to introduce a KubernetesExecutor. So, looking at Dask and Celery it doesn't seem they support a mode where their workers are created per task. Mesos might, Kubernetes should, but then you'd have to scale the clusters for the workers accordingly to account for turning off the nodes when un-needed.
We did a little work to get a cloud formation setup where celery workers scale out and in based on metrics from cloud-watch of the average cpu load across the tagged workers.
You would need to create a custom Executor (extended from BaseExecutor) capable of submitting and monitoring the AWS Batch jobs. Also may need to create a custom Docker image for the instances.
I found this repository in my case is working quite well https://github.com/aelzeiny/airflow-aws-executors I'm using Batch jobs with FARGATE_SPOT with computation engine.
I'm just struggling with the logging on AWS CloudWatch and the return status in AWS batch but from Airflow perspective is working

How do I setup an Airflow of 2 servers?

Trying to split out Airflow processes onto 2 servers. Server A, which has been already running in standalone mode with everything on it, has the DAGs and I'd like to set it as the worker in the new setup with an additional server.
Server B is the new server which would host the metadata database on MySQL.
Can I have Server A run LocalExecutor, or would I have to use CeleryExecutor? Would airflow scheduler has to run on the server that has the DAGs right? Or does it have to run on every server in a cluster? Confused as to what dependencies there are between the processes
This article does an excellent job demonstrating how to cluster Airflow onto multiple servers.
Multi-Node (Cluster) Airflow Setup
A more formal setup for Apache Airflow is to distribute the daemons across multiple machines as a cluster.
Benefits
Higher Availability
If one of the worker nodes were to go down or be purposely taken offline, the cluster would still be operational and tasks would still be executed.
Distributed Processing
If you have a workflow with several memory intensive tasks, then the tasks will be better distributed to allow for higher utilizaiton of data across the cluster and provide faster execution of the tasks.
Scaling Workers
Horizontally
You can scale the cluster horizontally and distribute the processing by adding more executor nodes to the cluster and allowing those new nodes to take load off the existing nodes. Since workers don’t need to register with any central authority to start processing tasks, the machine can be turned on and off without any downtime to the cluster.
Vertically
You can scale the cluster vertically by increasing the number of celeryd daemons running on each node. This can be done by increasing the value in the ‘celeryd_concurrency’ config in the {AIRFLOW_HOME}/airflow.cfg file.
Example:
celeryd_concurrency = 30
You may need to increase the size of the instances in order to support a larger number of celeryd processes. This will depend on the memory and cpu intensity of the tasks you’re running on the cluster.
Scaling Master Nodes
You can also add more Master Nodes to your cluster to scale out the services that are running on the Master Nodes. This will mainly allow you to scale out the Web Server Daemon incase there are too many HTTP requests coming for one machine to handle or if you want to provide Higher Availability for that service.
One thing to note is that there can only be one Scheduler instance running at a time. If you have multiple Schedulers running, there is a possibility that multiple instances of a single task will be scheduled. This could cause some major problems with your Workflow and cause duplicate data to show up in the final table if you were running some sort of ETL process.
If you would like, the Scheduler daemon may also be setup to run on its own dedicated Master Node.
Apache Airflow Cluster Setup Steps
Pre-Requisites
The following nodes are available with the given host names:
master1 - Will have the role(s): Web Server, Scheduler
master2 - Will have the role(s): Web Server
worker1 - Will have the role(s): Worker
worker2 - Will have the role(s): Worker
A Queuing Service is Running. (RabbitMQ, AWS SQS, etc)
You can install RabbitMQ by following these instructions: Installing RabbitMQ
If you’re using RabbitMQ, it is recommended that it is also setup to be a cluster for High Availability. Setup a Load Balancer to proxy requests to the RabbitMQ instances.
Additional Documentation
Documentation: https://airflow.incubator.apache.org/
Install Documentation: https://airflow.incubator.apache.org/installation.html
GitHub Repo: https://github.com/apache/incubator-airflow
All airflow processes need to have the same contents in their airflow_home folder. This includes configuration and dags. If you only want server B to run your MySQL database, you do not need to worry about any airflow specifics. Simply install the database on server B and change your airflow.cfg's sql_alchemy_conn parameter to point to your database on Server B and run airflow initdb from Server A.
If you also want to run airflow processes on server B, you would have to look into scaling using the CeleryExecutor.

Resources