Airflow submits same query multiple times - airflow

We are facing issue in airflow, where it is executing same query multiple times.
We checked that only one instance of DAG is running at that time.
In web UI,checked ran TASK instance but it shows only one task logs.
There are total 4 worker nodes and after Checking worker node logs found out that,
Worker Node 1: DML TASK executed by this worker node at 01.00
Worker Node 2: No DML executed but only Queuing attempt 1 at 01.00
Worker Node 3: No task run on this node at at 01.00
Worker Node 4: Same DML TASK executed by this worker node at 01.00
Worker Node 1 and Node 4 submitted same DML to sever at same time.
Any way to avoid airflow from submitting same query multiple times by different worker nodes?

Related

Airflow Scheduler handling queueing of dags

I have the following airflow setup
Executor : KubernetesExecutor
airflow version : 2.1.3
airflow config : parallelism = 256
I have the below scenario
I have a number of dags(eg 10) which are dependent on the success state of another task from another dag. The tasks kept failing with retries enabled for 6 times.
All the dependent dags run hourly and as a result they were added to the queue state by the scheduler. I can see around 800 dags were in queue and nothing was running. So I ended up manually changing their state to Fail.
Below are my questions from this event.
Is there a limit on the number of dags that can run concurrently in airflow set up ?
Is there a limit on how many dags can be enqueued ?
When dags are queued how does the scheduler decides which one to pick ? Is it based on queued time ?
Is is possible for setting up priority among the queued dags ?
How Does airflow 2.1.3 treats task in queue ? Are they counted against max_active_runs parameters ?

MWAA Airflow DAG takes more time to execute

Can anyone guide me how can I improve my AWS managed Airflow performance? I have test the Airflow DAGs with the following scenario.
Scneario:
I have two DAG files
DAG 1 has only one task
DAG 2 has six tasks from six one task is calling third party API (Third party API response time is 900 miliseconds, it is simple weather API for showing current weather of provided city. e.g https://api.weatherapi.com/v1/current.json?key={api_key}&q=Ahmedabad) and other 5 task are just for logs
I trigger DAG 1 with the custom payloads having 100 records
DAG 1 task just loop though the records and call the DAG 2 100 times with individual record
My concern is it is taking around 6 minute for DAG 2 to process all 100 execution
when I test the same code in the local Airflow installation it completes the DAG run within 1 minute
I have used the following Airflow configuration in AWS and same configuration I set for local Airflow airflow.cfg file
Airflow Configuration:(Airflow 2.2.2)
Maximum worker count: 12
core.dag_concurrency: 64
core.parallelism: 128
Can anyone guide me how can I improve my AWS Airflow performance to improve the parallalism of DAG run?

Limiting concurrency for a single task across DAG instances

I have a DAG (a >> b >> c >> d). This DAG can have up to 100 instances running at a time. It is fine for tasks a, b, and d to run concurrently; however, I would only like one dag_run to run task c at a time. How do I do this? Thanks!
You could try using Pools.
Pools are a classic way of limiting task execution in Airflow. You can assign individual tasks to a specific pool and control how many TaskInstances of that task are running concurrently. In your case, you could create a pool with a single slot, assign task C to this pool, and Airflow should only have one instance of that task running at any given time.

Is there way to have 3 set of worker nodes (groups) for airflow

We are setting up airflow for scheduling/orchestration , currently we have Spark python loads, and non-spark loads in different server and push files to gcp available in another server. Is there an option to decide to which worker nodes the airflow task are submitted? Currently we are using ssh connection to run all work loads. Our processing is mostly on-perm
Usage is celery executor model, How to we make sure that a specific task is run on its appropriate node.
task run a non spark server ( no spark binaries available)
task 2 executes PySpark submit - (This has spark binaries)
Task Push the files created from task 2 from another server/nodes ( Only this has the gcp utilities installed to push the files due to security reason ) .
If create a dag, is it possible to mention the task to execute on set of worker nodes ?
Currently we are having wrapper shell script for each task and making 3 ssh runs to complete these process. We would like to avoid such wrapper shell script rather use the inbuild have pythonOperator , SparkSubmitOperator, SparkJdbcOperator and SFTPToGCSOperator and make sure the specific task runs in specific server or worknodes .
In short , can we have 3 worker node groups and make the task to execute on a group of nodes based on the operations?
We can assign a queue to each worker node like
Start the airflow worker with mentioning the queue
airflow worker -q sparkload
airflow worker -q non-sparkload
airflow worker -q gcpload
The start each task with queue mentioned. Similar thread found as well.
How can Airflow be used to run distinct tasks of one workflow in separate machines?

Running a particular task on airflow master node

I have a dag with a list of tasks that are run using the celery executor on different worker nodes. However I would like to run one of the tasks on the master node. Is that possible?
Yes it is possible. You are able to set specific tasks to listen to specific queues in Celery. The airflow documentation covers it quite nicely but the gist of it is:
set a queue attribute on the operator representing the task you want to run on a specific node to a value different from the celery -> default_queue value in airflow.cfg
Run the worker process on your master node by specifying the queue it needs to listen on airflow worker -q queue_name. If you want your worker to listen to multiple queues, you can use a comma delimited list airflow worker -q default_queue,queue_name

Resources