This question already has answers here:
Airflow scheduler is slow to schedule subsequent tasks
(2 answers)
Closed 3 years ago.
I have a problem with long waits between tasks in the same DAG. I'm running Airflow 1.10. Can anyone point to which settings that are relevant to tweak? Please see example below.
The longest wait here is more than one and a half hour.
Only thing that worked for me is restarting the scheduler. So I now have a cron job to restart the scheduler every 5 minutes.
Related
Can anyone guide me how can I improve my AWS managed Airflow performance? I have test the Airflow DAGs with the following scenario.
Scneario:
I have two DAG files
DAG 1 has only one task
DAG 2 has six tasks from six one task is calling third party API (Third party API response time is 900 miliseconds, it is simple weather API for showing current weather of provided city. e.g https://api.weatherapi.com/v1/current.json?key={api_key}&q=Ahmedabad) and other 5 task are just for logs
I trigger DAG 1 with the custom payloads having 100 records
DAG 1 task just loop though the records and call the DAG 2 100 times with individual record
My concern is it is taking around 6 minute for DAG 2 to process all 100 execution
when I test the same code in the local Airflow installation it completes the DAG run within 1 minute
I have used the following Airflow configuration in AWS and same configuration I set for local Airflow airflow.cfg file
Airflow Configuration:(Airflow 2.2.2)
Maximum worker count: 12
core.dag_concurrency: 64
core.parallelism: 128
Can anyone guide me how can I improve my AWS Airflow performance to improve the parallalism of DAG run?
I have a airflow dag-1 that runs approximately for week and dag-2 that runs every day for few hours. When the dag-1 is running i cannot have the dag-2 running due to API limit rate (also dag-2 is supposed to run once dag-1 is finished).
Suppose the dag-1 is running already, then dag-2 that is supposed to run everyday fails, is there a way i can schedule the dag dependencies in a right way?
Is it possible to stop dag-1 temporarily(while running) when dag-2 is supposed to start and then run dag-1 again without manual interruption?
One of the best way is to use the defined pool ..
Lets say if you have a pool named: "specefic_pool" and allocate only one slot for it.
Specify the pool name in your dag bash command (instead of default pool, please use newly created pool) By that way you may over come of running both the dags parallel .
This helps whenever Dag1 is running Dag2 will never be triggered until pool is free or if the dag2 picked the pool until dag2 is completed dag1 is not going to get triggered.
I'm using airflow to automatize some machine learning models. Everything it's successfull but i have issues according to the order of the tasks.
I have a 7 tasks running in paralel and the last two tasks must start when those 7 tasks finish.
At the time 6 tasks finishes, the last two start without waiting for the 7th taks to finish.
Here's the image of whats happening.
It appears that trigger_rule of your creation_order_cell_task is incorrect (fot the desired behaviour)
To get the behaviour you want, it should be either ALL_SUCCESS (default) or ALL_DONE
I am trying to diagnose an under-performing airflow pipeline and am wondering what kind of performance I should expect out of the airflow scheduler in terms similar to "tasks scheduled per second".
I have few queued jobs and many of my tasks finish in seconds so I suspect the scheduler is the limiting component and it is my fault for having many quick tasks. Still, I would rather not rewrite my DAGs if it can be avoided.
What can I do to increase the rate at which the scheduler queues tasks?
Pipeline Details
Here is what my current airflow.cfg looks like.
I only have two dags running. One is scheduled every 5 min and the other is rarely triggered by the first. I am currently trying to backfill several years at this frequency, but may need to change my approach:
As for worker nodes: I currently have 4 fairly powerful servers running at less than 10% resource usage in disk, network, cpu, RAM, swap. Toggling 3 of the workers off has no impact on my task throughput and the server left on barely even registers the change in workload.
There are a number of config values in your airflow.cfg that could be related to this.
Under [core]:
parallelism: Total number of task instances that can run at once.
dag_concurrency: Limit of task instances that can run per DAG run, may need to bump if you have many parallel tasks. Can override when defining a DAG.
non_pooled_task_slot_count: Limit of tasks without a pool configured that can run at once.
max_active_runs_per_dag: The maximum number of active DAG runs per DAG. If you're triggering runs manually or there's a backup of DAG runs scheduled with a short interval. Can override when defining a DAG.
Under [scheduler]:
schedule_heartbeat_sec: Defines how often the scheduler runs, try it out with lower values.
min_file_process_interval: Process each file at most once every N seconds. Set to 0 to never limit how often you process a file.
Under [worker]:
celeryd_concurrency: Number of workers celery will run with, so essentially number of task instances a worker can take at once. Matching the number of CPUs is a popular starting point, but can definitely go higher.
Last one is only if you're using the CeleryExecutor, which I'd definitely recommend if you're looking to increase your task throughput.
We have multiple jobs configured in oozie, and some jobs are signal free and some are dependent on signals. We have given signal free jobs start time 2AM and the jobs were starting firing during that time. From last one month we have noticed those signal free jobs started delaying by 1 hour. We are not sure why is it happening.
does any one have idea on this, why oozie jobs started executing with delay.