Can anyone guide me how can I improve my AWS managed Airflow performance? I have test the Airflow DAGs with the following scenario.
Scneario:
I have two DAG files
DAG 1 has only one task
DAG 2 has six tasks from six one task is calling third party API (Third party API response time is 900 miliseconds, it is simple weather API for showing current weather of provided city. e.g https://api.weatherapi.com/v1/current.json?key={api_key}&q=Ahmedabad) and other 5 task are just for logs
I trigger DAG 1 with the custom payloads having 100 records
DAG 1 task just loop though the records and call the DAG 2 100 times with individual record
My concern is it is taking around 6 minute for DAG 2 to process all 100 execution
when I test the same code in the local Airflow installation it completes the DAG run within 1 minute
I have used the following Airflow configuration in AWS and same configuration I set for local Airflow airflow.cfg file
Airflow Configuration:(Airflow 2.2.2)
Maximum worker count: 12
core.dag_concurrency: 64
core.parallelism: 128
Can anyone guide me how can I improve my AWS Airflow performance to improve the parallalism of DAG run?
Related
I have the following airflow setup
Executor : KubernetesExecutor
airflow version : 2.1.3
airflow config : parallelism = 256
I have the below scenario
I have a number of dags(eg 10) which are dependent on the success state of another task from another dag. The tasks kept failing with retries enabled for 6 times.
All the dependent dags run hourly and as a result they were added to the queue state by the scheduler. I can see around 800 dags were in queue and nothing was running. So I ended up manually changing their state to Fail.
Below are my questions from this event.
Is there a limit on the number of dags that can run concurrently in airflow set up ?
Is there a limit on how many dags can be enqueued ?
When dags are queued how does the scheduler decides which one to pick ? Is it based on queued time ?
Is is possible for setting up priority among the queued dags ?
How Does airflow 2.1.3 treats task in queue ? Are they counted against max_active_runs parameters ?
Airflow Version 2.0.2
I have three schedulers running in a kubernetes cluster running the CeleryExecutor with a postgres backend. Everything seems to run fine for a couple of weeks, but then the airflow scheduler stops scheduling some tasks. I've done an airflow db reset followed by an airflow db init and a fresh deployment of the airflow-specific images. Below are some of the errors I've received from logging in the database:
According to https://github.com/apache/airflow/issues/19811 the slot_pool issue is expected behavior, but I cannot figure out why DAGs suddenly stop being scheduled on time. For reference, there are ~500 DAGs being run every 15 minutes.
LOG: could not receive data from client: Connection timed out
STATEMENT: SELECT slot_pool.pool AS slot_pool_pool, slot_pool.slots AS slot_pool_slots
FROM slot_pool FOR UPDATE NOWAIT
The slot_pool table looks like this:
select * from slot_pool;
id | pool | slots | description
----+--------------+-------+--------------
1 | default_pool | 128 | Default pool
(1 row)
I have looked at several posts, but none of the posts seem to explain the issue or provide a solution. Below are a few of them:
Airflow initdb slot_pool does not exists
Running multiple Airflow Schedulers cause Postgres locking issues
Airflow tasks get stuck at "queued" status and never gets running
I am just trying to figure if there is a way to limit the duration of a DAG in airflow. For example, set max time a DAG can run for to 30 minutes.
DAGs have dagrun_timeout parameter indeed, but it works only when max_active_runs for the DAG is reached (16 by default). For example if you have 15 active DAGs, the scheduler will just launch 16th. But before launching next one the scheduler will wait until one of previous finishes or exceedes timeout.
But you can use execution_timeout for task instance. This parameter works unconditionally.
I am able to configure airflow.cfg file to run tasks one after the other.
What I want to do is, execute tasks in parallel, e.g. 2 at a time and reach the end of list.
How can I configure this?
Executing tasks in Airflow in parallel depends on which executor you're using, e.g., SequentialExecutor, LocalExecutor, CeleryExecutor, etc.
For a simple setup, you can achieve parallelism by just setting your executor to LocalExecutor in your airflow.cfg:
[core]
executor = LocalExecutor
Reference: https://github.com/apache/incubator-airflow/blob/29ae02a070132543ac92706d74d9a5dc676053d9/airflow/config_templates/default_airflow.cfg#L76
This will spin up a separate process for each task.
(Of course you'll need to have a DAG with at least 2 tasks that can execute in parallel to see it work.)
Alternatively, with CeleryExecutor, you can spin up any number of workers by just running (as many times as you want):
$ airflow worker
The tasks will go into a Celery queue and each Celery worker will pull off of the queue.
You might find the section Scaling out with Celery in the Airflow Configuration docs helpful.
https://airflow.apache.org/howto/executor/use-celery.html
For any executor, you may want to tweak the core settings that control parallelism once you have that running.
They're all found under [core]. These are the defaults:
# The amount of parallelism as a setting to the executor. This defines
# the max number of task instances that should run simultaneously
# on this airflow installation
parallelism = 32
# The number of task instances allowed to run concurrently by the scheduler
dag_concurrency = 16
# Are DAGs paused by default at creation
dags_are_paused_at_creation = True
# When not using pools, tasks are run in the "default pool",
# whose size is guided by this config element
non_pooled_task_slot_count = 128
# The maximum number of active DAG runs per DAG
max_active_runs_per_dag = 16
Reference: https://github.com/apache/incubator-airflow/blob/29ae02a070132543ac92706d74d9a5dc676053d9/airflow/config_templates/default_airflow.cfg#L99
We are facing issue in airflow, where it is executing same query multiple times.
We checked that only one instance of DAG is running at that time.
In web UI,checked ran TASK instance but it shows only one task logs.
There are total 4 worker nodes and after Checking worker node logs found out that,
Worker Node 1: DML TASK executed by this worker node at 01.00
Worker Node 2: No DML executed but only Queuing attempt 1 at 01.00
Worker Node 3: No task run on this node at at 01.00
Worker Node 4: Same DML TASK executed by this worker node at 01.00
Worker Node 1 and Node 4 submitted same DML to sever at same time.
Any way to avoid airflow from submitting same query multiple times by different worker nodes?