I have a number of DAGs that wait for EOD settlement, and limited worker slot. The settlement ends at varying times. So while waiting the settlement, I want to run a different DAG on the worker slot. From Airflow documentation, deferable operator looks fit for this kind of purpose. I'm new to python and airflow. Can somebody explain how to write deferable sql sensor ?
I have looked at deferable time sensor examples, but can't make it to work with sql sensors.
Related
I currently have a PythonSensor which waits for files on an ftp server. Is it possible to have this sensor trigger a task on timeout? I am trying to create the following dag:
airflow sensor diagram
I have taken a look at BranchPythonOperator but it seems like I no longer get the benefits of rescheduling a task if it fails the first time.
Have you tried to use trigger_rule="all_failed" in your task?
all_failed: All upstream tasks are in a failed or upstream_failed state
See http://airflow.apache.org/docs/apache-airflow/stable/concepts/dags.html?highlight=all_failed#trigger-rules
And an example here http://airflow.apache.org/docs/apache-airflow/stable/faq.html?highlight=all_failed#how-to-trigger-tasks-based-on-another-task-s-failure
I am trying to diagnose an under-performing airflow pipeline and am wondering what kind of performance I should expect out of the airflow scheduler in terms similar to "tasks scheduled per second".
I have few queued jobs and many of my tasks finish in seconds so I suspect the scheduler is the limiting component and it is my fault for having many quick tasks. Still, I would rather not rewrite my DAGs if it can be avoided.
What can I do to increase the rate at which the scheduler queues tasks?
Pipeline Details
Here is what my current airflow.cfg looks like.
I only have two dags running. One is scheduled every 5 min and the other is rarely triggered by the first. I am currently trying to backfill several years at this frequency, but may need to change my approach:
As for worker nodes: I currently have 4 fairly powerful servers running at less than 10% resource usage in disk, network, cpu, RAM, swap. Toggling 3 of the workers off has no impact on my task throughput and the server left on barely even registers the change in workload.
There are a number of config values in your airflow.cfg that could be related to this.
Under [core]:
parallelism: Total number of task instances that can run at once.
dag_concurrency: Limit of task instances that can run per DAG run, may need to bump if you have many parallel tasks. Can override when defining a DAG.
non_pooled_task_slot_count: Limit of tasks without a pool configured that can run at once.
max_active_runs_per_dag: The maximum number of active DAG runs per DAG. If you're triggering runs manually or there's a backup of DAG runs scheduled with a short interval. Can override when defining a DAG.
Under [scheduler]:
schedule_heartbeat_sec: Defines how often the scheduler runs, try it out with lower values.
min_file_process_interval: Process each file at most once every N seconds. Set to 0 to never limit how often you process a file.
Under [worker]:
celeryd_concurrency: Number of workers celery will run with, so essentially number of task instances a worker can take at once. Matching the number of CPUs is a popular starting point, but can definitely go higher.
Last one is only if you're using the CeleryExecutor, which I'd definitely recommend if you're looking to increase your task throughput.
I want to use Airflow to implement data flows that periodically poll external systems (ftp servers, etc), check for new files matching certain conditions, and then run a bunch of tasks for those files. Now, I'm a newbie to Airflow and read that Sensors are something you would use for this kind of a case, and I actually managed to write a sensor that works ok when I run "airflow test" for it. But I'm a bit confused regarding the relation of poke_interval for the sensor and the DAG scheduling. How should I define those settings for my use case? Or should I use some other approach? I just want Airflow to run the tasks when those files become available, and not flood the dashboard with failures when no new files were available for a while.
Your understanding is correct, using a sensor is the way to go when you want to poll, either by using an existing sensor or by implementing your own.
They are, however, always part of a DAG and they do not execute outside of its boundaries. DAG execution depends on the start_date and schedule_interval, but you can leverage this and a sensor to implement some sort of DAG depending on the status of an external server: one possible approach would be starting the whole DAG with a sensor which checks for a condition to occur and decide to skip the whole DAG if the condition is not met (you can make sure that sensors mark downstream tasks as skipped and not failed by setting their soft_fail parameter to True). You can have a polling interval of one minute by using the most frequent scheduling option (* * * * *). If you really need a shortest polling time you can tweak the sensor's poke_interval and timeout parameters.
Keep in mind, however, that execution times are not probably guaranteed by Airflow itself, so for very short polling times you may want to investigate alternatives (or at least consider different approaches to the one I've just shared).
the Local Executor spawns new processes while scheduling tasks. Is there a limit to the number of processes it creates. I needed to change it. I need to know what is the difference between scheduler's "max_threads" and
"parallelism" in airflow.cfg ?
parallelism: not a very descriptive name. The description says it sets the maximum task instances for the airflow installation, which is a bit ambiguous — if I have two hosts running airflow workers, I'd have airflow installed on two hosts, so that should be two installations, but based on context 'per installation' here means 'per Airflow state database'. I'd name this max_active_tasks.
dag_concurrency: Despite the name based on the comment this is actually the task concurrency, and it's per worker. I'd name this max_active_tasks_for_worker (per_worker would suggest that it's a global setting for workers, but I think you can have workers with different values set for this).
max_active_runs_per_dag: This one's kinda alright, but since it seems to be just a default value for the matching DAG kwarg, it might be nice to reflect that in the name, something like default_max_active_runs_for_dags
So let's move on to the DAG kwargs:
concurrency: Again, having a general name like this, coupled with the fact that concurrency is used for something different elsewhere makes this pretty confusing. I'd call this max_active_tasks.
max_active_runs: This one sounds alright to me.
source: https://issues.apache.org/jira/browse/AIRFLOW-57
max_threads gives the user some control over cpu usage. It specifies scheduler parallelism.
It's 2019 and more updated docs have come out. In short:
AIRFLOW__CORE__PARALLELISM is the max number of task instances that can run concurrently across ALL of Airflow (all tasks across all dags)
AIRFLOW__CORE__DAG_CONCURRENCY is the max number of task instances allowed to run concurrently FOR A SINGLE SPECIFIC DAG
These docs describe it in more detail:
According to https://www.astronomer.io/guides/airflow-scaling-workers/:
parallelism is the max number of task instances that can run
concurrently on airflow. This means that across all running DAGs, no
more than 32 tasks will run at one time.
And
dag_concurrency is the number of task instances allowed to run
concurrently within a specific dag. In other words, you could have 2
DAGs running 16 tasks each in parallel, but a single DAG with 50 tasks
would also only run 16 tasks - not 32
And, according to https://airflow.apache.org/faq.html#how-to-reduce-airflow-dag-scheduling-latency-in-production:
max_threads: Scheduler will spawn multiple threads in parallel to
schedule dags. This is controlled by max_threads with default value of
2. User should increase this value to a larger value(e.g numbers of cpus where scheduler runs - 1) in production.
But it seems like this last piece shouldn't take up too much time, because it's just the "scheduling" portion. Not the actual running portion. Therefore we didn't see the need to tweak max_threads much, but AIRFLOW__CORE__PARALLELISM and AIRFLOW__CORE__DAG_CONCURRENCY did affect us.
The scheduler's max_threads is the number of processes to parallelize the scheduler over. The max_threads cannot exceed the cpu count. The LocalExecutor's parallelism is the number of concurrent tasks the LocalExecutor should run. Both the scheduler and the LocalExecutor use python's multiprocessing library for parallelism.
I have an application that runs as a web service, which submits jobs to Spark on a user request. A job queue needs to be limited per user. I am planning to use Airflow as an orchestration framework to manage job queues but while it supports parallel DAG execution it's optimized for batch processing rather than real time. Is Airflow designed to handle ~200 DAG executions per second with multiple queues (one per user) or should I look for alternatives?
Do you have data move from one task to another? Does time matter here since you mentioned real-time. With Airflow, workflows are expected to be mostly static or slowly changing. Mostly for ETL batch processing, you can speed up the airflow heartbeat, but would be good to have a POC with your use case to test out.
Below is from Airflow official document: https://airflow.apache.org/#beyond-the-horizon
Airflow is not a data streaming solution. Tasks do not move data from
one to the other (though tasks can exchange metadata!). Airflow is not
in the Spark Streaming or Storm space, it is more comparable to Oozie
or Azkaban