I am just trying to figure if there is a way to limit the duration of a DAG in airflow. For example, set max time a DAG can run for to 30 minutes.
DAGs have dagrun_timeout parameter indeed, but it works only when max_active_runs for the DAG is reached (16 by default). For example if you have 15 active DAGs, the scheduler will just launch 16th. But before launching next one the scheduler will wait until one of previous finishes or exceedes timeout.
But you can use execution_timeout for task instance. This parameter works unconditionally.
Related
I have the following airflow setup
Executor : KubernetesExecutor
airflow version : 2.1.3
airflow config : parallelism = 256
I have the below scenario
I have a number of dags(eg 10) which are dependent on the success state of another task from another dag. The tasks kept failing with retries enabled for 6 times.
All the dependent dags run hourly and as a result they were added to the queue state by the scheduler. I can see around 800 dags were in queue and nothing was running. So I ended up manually changing their state to Fail.
Below are my questions from this event.
Is there a limit on the number of dags that can run concurrently in airflow set up ?
Is there a limit on how many dags can be enqueued ?
When dags are queued how does the scheduler decides which one to pick ? Is it based on queued time ?
Is is possible for setting up priority among the queued dags ?
How Does airflow 2.1.3 treats task in queue ? Are they counted against max_active_runs parameters ?
I have a airflow dag-1 that runs approximately for week and dag-2 that runs every day for few hours. When the dag-1 is running i cannot have the dag-2 running due to API limit rate (also dag-2 is supposed to run once dag-1 is finished).
Suppose the dag-1 is running already, then dag-2 that is supposed to run everyday fails, is there a way i can schedule the dag dependencies in a right way?
Is it possible to stop dag-1 temporarily(while running) when dag-2 is supposed to start and then run dag-1 again without manual interruption?
One of the best way is to use the defined pool ..
Lets say if you have a pool named: "specefic_pool" and allocate only one slot for it.
Specify the pool name in your dag bash command (instead of default pool, please use newly created pool) By that way you may over come of running both the dags parallel .
This helps whenever Dag1 is running Dag2 will never be triggered until pool is free or if the dag2 picked the pool until dag2 is completed dag1 is not going to get triggered.
I'm using airflow to automatize some machine learning models. Everything it's successfull but i have issues according to the order of the tasks.
I have a 7 tasks running in paralel and the last two tasks must start when those 7 tasks finish.
At the time 6 tasks finishes, the last two start without waiting for the 7th taks to finish.
Here's the image of whats happening.
It appears that trigger_rule of your creation_order_cell_task is incorrect (fot the desired behaviour)
To get the behaviour you want, it should be either ALL_SUCCESS (default) or ALL_DONE
I recently upgraded from v1.7.1.2 to v1.9.0 and after the upgrade I noticed that the CPU usage increased significantly. After doing some digging, I tracked it down to these two scheduler config options: min_file_process_interval (defaults to 0) and max_threads (defaults to 2).
As expected, increasing min_file_process_interval avoids the tight loop and drops cpu usage when it goes idle. But what I don't understand is why min_file_process_interval affects tasks execution?
If I set min_file_process_interval to 60s, it now waits no less than 60s between executing each task in my DAG, so if my dag has 4 sequential tasks it has now added 4 minutes to my execution time. For example:
start -> [task1] -> [task2] -> [task3] -> [task4]
^ ^ ^ ^
60s 60s 60s 60s
I have Airflow setup in my test env and prod env. This is less of an issue in my prod env (although still concerning), but a big issue for my test env. After the upgrade the CPU usage is significantly higher so either I accept higher CPU usage or try to decrease it with a higher config value. However, this adds significant time to my test dags execution time.
Why does min_file_process_interval affect time between tasks after the DAG has been scheduled? Are there other config options that could solve my issue?
Another option you might want to look into is
SCHEDULER_HEARTBEAT_SEC
This setting is usually also set to a very tight interval but could loosened up a bit. This setting in combination with
MAX_THREADS
did the trick for us. The dev machines are fast enough for re-deployment but without a hot, glowing CPU which is good.
The most likely cause is that there are too many python files in the dags folder, and the airflow scheduler scans the instantiated DAG too much.
It is recommended to reduce the number of dag files under scheduler and worker first. At the same time, the SCHEDULER_HEARTBEAT_SEC and MAX_THREADS values are set as large as possible.
the Local Executor spawns new processes while scheduling tasks. Is there a limit to the number of processes it creates. I needed to change it. I need to know what is the difference between scheduler's "max_threads" and
"parallelism" in airflow.cfg ?
parallelism: not a very descriptive name. The description says it sets the maximum task instances for the airflow installation, which is a bit ambiguous — if I have two hosts running airflow workers, I'd have airflow installed on two hosts, so that should be two installations, but based on context 'per installation' here means 'per Airflow state database'. I'd name this max_active_tasks.
dag_concurrency: Despite the name based on the comment this is actually the task concurrency, and it's per worker. I'd name this max_active_tasks_for_worker (per_worker would suggest that it's a global setting for workers, but I think you can have workers with different values set for this).
max_active_runs_per_dag: This one's kinda alright, but since it seems to be just a default value for the matching DAG kwarg, it might be nice to reflect that in the name, something like default_max_active_runs_for_dags
So let's move on to the DAG kwargs:
concurrency: Again, having a general name like this, coupled with the fact that concurrency is used for something different elsewhere makes this pretty confusing. I'd call this max_active_tasks.
max_active_runs: This one sounds alright to me.
source: https://issues.apache.org/jira/browse/AIRFLOW-57
max_threads gives the user some control over cpu usage. It specifies scheduler parallelism.
It's 2019 and more updated docs have come out. In short:
AIRFLOW__CORE__PARALLELISM is the max number of task instances that can run concurrently across ALL of Airflow (all tasks across all dags)
AIRFLOW__CORE__DAG_CONCURRENCY is the max number of task instances allowed to run concurrently FOR A SINGLE SPECIFIC DAG
These docs describe it in more detail:
According to https://www.astronomer.io/guides/airflow-scaling-workers/:
parallelism is the max number of task instances that can run
concurrently on airflow. This means that across all running DAGs, no
more than 32 tasks will run at one time.
And
dag_concurrency is the number of task instances allowed to run
concurrently within a specific dag. In other words, you could have 2
DAGs running 16 tasks each in parallel, but a single DAG with 50 tasks
would also only run 16 tasks - not 32
And, according to https://airflow.apache.org/faq.html#how-to-reduce-airflow-dag-scheduling-latency-in-production:
max_threads: Scheduler will spawn multiple threads in parallel to
schedule dags. This is controlled by max_threads with default value of
2. User should increase this value to a larger value(e.g numbers of cpus where scheduler runs - 1) in production.
But it seems like this last piece shouldn't take up too much time, because it's just the "scheduling" portion. Not the actual running portion. Therefore we didn't see the need to tweak max_threads much, but AIRFLOW__CORE__PARALLELISM and AIRFLOW__CORE__DAG_CONCURRENCY did affect us.
The scheduler's max_threads is the number of processes to parallelize the scheduler over. The max_threads cannot exceed the cpu count. The LocalExecutor's parallelism is the number of concurrent tasks the LocalExecutor should run. Both the scheduler and the LocalExecutor use python's multiprocessing library for parallelism.