The celery executor is not able to execute a multi-branch pipeline. The same which we tried from LocalExecutor it is able to execute multi branches.
And Celery executor is creating one message for one DAG in Redis. is this expected behavior?
taskA1->taskA2-> taskA3
|
taskB1->taskB2--|
It is picking up only one flow either taskA1 or taskB1 flow. Other flow will keep remain in the queued state.
a >> b
b >> c
a1 >> b1
b1 >> c1
Related
I am trying to setup airflow cluster for my project and I am using celery executor as the executor. Along with this I am using Rabbitmq as queueing service, postgresql as database. For now I have two master nodes and two worker nodes. All the services are up and running, I was able to configure my master nodes with airflow webserver and scheduler. But for my worker nodes, I am running into an issue where I get an error:
airflow command error: argument GROUP_OR_COMMAND: celery subcommand works only with CeleryExecutor, CeleryKubernetesExecutor and executors derived from them, your current executor: SequentialExecutor, subclassed from: BaseExecutor, see help above.
I did configured my airflow.cfg properly. I did set executor value to CeleryExecutor (Doesn't this mean I have set the executor value).
My airflow.cfg is as follows:
Note: I am just adding parts of the config that I think is relevant to the issue.
[celery]
# This section only applies if you are using the CeleryExecutor in
# ``[core]`` section above
# The app name that will be used by celery
celery_app_name = airflow.executors.celery_executor
# The concurrency that will be used when starting workers with the
# ``airflow celery worker`` command. This defines the number of task instances that
# a worker will take, so size up your workers based on the resources on
# your worker box and the nature of your tasks
worker_concurrency = 16
# The maximum and minimum concurrency that will be used when starting workers with the
# ``airflow celery worker`` command (always keep minimum processes, but grow
# to maximum if necessary). Note the value should be max_concurrency,min_concurrency
# Pick these numbers based on resources on worker box and the nature of the task.
# If autoscale option is available, worker_concurrency will be ignored.
# http://docs.celeryproject.org/en/latest/reference/celery.bin.worker.html#cmdoption-celery-worker-autoscale
# Example: worker_autoscale = 16,12
# worker_autoscale =
# Used to increase the number of tasks that a worker prefetches which can improve performance.
# The number of processes multiplied by worker_prefetch_multiplier is the number of tasks
# that are prefetched by a worker. A value greater than 1 can result in tasks being unnecessarily
# blocked if there are multiple workers and one worker prefetches tasks that sit behind long
# running tasks while another worker has unutilized processes that are unable to process the already
# claimed blocked tasks.
# https://docs.celeryproject.org/en/stable/userguide/optimizing.html#prefetch-limits
worker_prefetch_multiplier = 1
# Specify if remote control of the workers is enabled.
# When using Amazon SQS as the broker, Celery creates lots of ``.*reply-celery-pidbox`` queues. You can
# prevent this by setting this to false. However, with this disabled Flower won't work.
worker_enable_remote_control = true
# Umask that will be used when starting workers with the ``airflow celery worker``
# in daemon mode. This control the file-creation mode mask which determines the initial
# value of file permission bits for newly created files.
worker_umask = 0o077
# The Celery broker URL. Celery supports RabbitMQ, Redis and experimentally
# a sqlalchemy database. Refer to the Celery documentation for more information.
broker_url = amqp://admin:password#{hostname}:5672/
# The Celery result_backend. When a job finishes, it needs to update the
# metadata of the job. Therefore it will post a message on a message bus,
# or insert it into a database (depending of the backend)
# This status is used by the scheduler to update the state of the task
# The use of a database is highly recommended
# http://docs.celeryproject.org/en/latest/userguide/configuration.html#task-result-backend-settings
result_backend = db+postgresql://postgres:airflow#postgres/airflow
# The executor class that airflow should use. Choices include
# ``SequentialExecutor``, ``LocalExecutor``, ``CeleryExecutor``, ``DaskExecutor``,
# ``KubernetesExecutor``, ``CeleryKubernetesExecutor`` or the
# full import path to the class when using a custom executor.
executor = CeleryExecutor
Please let me know if I haven't added sufficient information pertinent to my problem. Thank you.
The reason for the above error could be:-
Airflow is picking the default value of the executor which is in the core section of airflow.cfg (i.e SequentialExecutor). This is the template for Airflow's default configuration. When Airflow is imported, it looks for a configuration file at $AIRFLOW_HOME/airflow.cfg. If it doesn't exist, Airflow uses this template.
The following solution is applicable if you are using the official helm chart:-
Change the default value of the executor in the core section of airflow.cfg.
Snapshot of default configuration
Pass the environment variable called AIRFLOW_HOME in the flower deployment/container. You can simply pass environment variables in all the containers by passing the following in the values file of the helm chart:
env:
- name: "AIRFLOW_HOME"
value: "/path/to/airflow/home"
In case the airflow user doesn't have access to the path you passed in the environment variable AIRFLOW_HOME, run the flower container as a root user which can be done by passing the following config in the values file of helm chat.
flower:
enabled: true
securityContext:
runAsUser: 0
Is there a way to add a task that runs once all other tasks have run successfully in the same DAG, see below for my current DAG.
For example my current tasks run in the below order, but I want to add new_task once all of the below runs. If I leave it as the below it won't run new_task:
for endpoint in ENDPOINTS:
latest_only = (operator...)
s3 = (operator...)
etc ....
latest_only >> s3 >> short_circuit
short_circuit >> snowflake >> success
short_circuit >> postgres >> success
if endpoint.name == "io_lineitems":
success >> il_io_lineitems_tables
copy_monthly_billing >> load_io_monthly_billing_to_snowflake
copy_monthly_billing >> load_io_monthly_billing_to_postgres
new_task
Use trigger rules to trigger based on upstream task success status
trigger_rules
I am trying to add alerts to my airflow dags. The dags have multiple tasks, some upto 15.
I want to execute a bash script ( a general script for all dags ), in case any task at any point fails.
An example, a dag has tasks T1 to T5, as T1 >> T2 >> T3 >> T4 >> T5.
I want to trigger task A ( representing alerts ) in case any of these fail.
Would be really helpful in anyone can help me with the task hierarchy.
You have two options IMO. Failure callback and Trigger Rules
Success / Failure Callback
Airflow Task Instances have a concept of what to do in case of failure or success. These are callbacks that will be run in the case of a Task reaching a specific state... here are your options:
...
on_failure_callback=None,
on_success_callback=None,
on_retry_callback=None
...
Trigger Rules
Airflow Task Instances have a concept of what state of their upstream to trigger on with the default being ALL_SUCCESS. That means your main branch can stay as it is. And you can branch where you want with A from T1 as:
from airflow.utils.trigger_rule import TriggerRule
T1 >> DummyOperator(
dag=dag,
task_id="task_a",
trigger_rule=TriggerRule.ALL_FAILED
)
Alternatively, you can build your branch and include A as:
from airflow.utils.trigger_rule import TriggerRule
[T1, T2, T3, ...] >> DummyOperator(
dag=dag,
task_id="task_a",
trigger_rule=TriggerRule.ONE_FAILED
)
Our airflow installation is using CeleryExecutor.
The concurrency configs were
# The amount of parallelism as a setting to the executor. This defines
# the max number of task instances that should run simultaneously
# on this airflow installation
parallelism = 16
# The number of task instances allowed to run concurrently by the scheduler
dag_concurrency = 16
# Are DAGs paused by default at creation
dags_are_paused_at_creation = True
# When not using pools, tasks are run in the "default pool",
# whose size is guided by this config element
non_pooled_task_slot_count = 64
# The maximum number of active DAG runs per DAG
max_active_runs_per_dag = 16
[celery]
# This section only applies if you are using the CeleryExecutor in
# [core] section above
# The app name that will be used by celery
celery_app_name = airflow.executors.celery_executor
# The concurrency that will be used when starting workers with the
# "airflow worker" command. This defines the number of task instances that
# a worker will take, so size up your workers based on the resources on
# your worker box and the nature of your tasks
celeryd_concurrency = 16
We have a dag that executes daily. It has around some tasks in parallel following a pattern that senses whether the data exists in hdfs then sleep 10 mins, and finally upload to s3.
Some of the tasks has been encountering the following error:
2019-05-12 00:00:46,212 ERROR - Executor reports task instance <TaskInstance: example_dag.task1 2019-05-11 04:00:00+00:00 [queued]> finished (failed) although the task says its queued. Was the task killed externally?
2019-05-12 00:00:46,558 INFO - Marking task as UP_FOR_RETRY
2019-05-12 00:00:46,561 WARNING - section/key [smtp/smtp_user] not found in config
This kind of error occurs randomly in those tasks. When this error happens, the state of task instance is immediately set to up_for_retry, and no logs in the worker nodes. After some retries, they execute and finished eventually.
This problem sometimes gives us large ETL delay. Anyone knows how to solve this problem?
We were facing similar problems , which was resolved by
"-x, --donot_pickle" option.
For more information :- https://airflow.apache.org/cli.html#backfill
I was seeing very similar symptoms in my DagRuns. I thought it was due to the ExternalTaskSensor and concurrency issues given the queuing and killed task language that looked like this: Executor reports task instance <TaskInstance: dag1.data_table_temp_redshift_load 2019-05-20 08:00:00+00:00 [queued]> finished (failed) although the task says its queued. Was the task killed externally? But when I looked at the worker logs, I saw there was an error caused by setting a variable with Variable.set in my DAG. The issue is described here duplicate key value violates unique constraint when adding path variable in airflow dag where the scheduler polls the dagbag at regular intervals to refresh any changes dynamically. The error with every heartbeat was causing significant ETL delays.
Are you performing any logic in your wh_hdfs_to_s3 DAG (or others) that might be causing errors or delays / these symptoms?
We fixed this already. Let me answer myself question:
We have 5 airflow worker nodes. After installing flower to monitor the tasks distributed to these nodes. We found out that the failed task was always sent to a specific node. We tried to use airflow test command to run the task in other nodes and they worked. Eventually, the reason was a wrong python package in that specific node.
I have 2 dags - dag a and dag b.
I have used triggerdagrun operator in dag a and passed the dag id task id and parameters in the triggerdagrun operator.
The task that triggers the second dag executed successfully and the status of dag b is running. But the task in dag b didn't get triggered. The schedule interval for dag b is none.
Can someone help me in resolving this issue?
Did you
start the scheduler (via airflow scheduler)?
enable the "dag to be triggered" (per default DAGs are paused)?
... both are necessary conditions such that tasks get run...