I have a task that can run for ~10+ hours, say in a for loop.
I want to store checkpoints after each loop execution, so that if there is some error in the task or if the worker crashes, the retried task can resume from where it left off by retrieving the checkpoint information specific to that task run.
So, the question is, how and where can I store this check point information?
The task logic is below :-
long_running_task:
seqNo = getStoredCheckpointForTask()
do
if(seqNo == null )
seqNo = getFirstSequenceFromSomeSource() //1-2 seconds
doSomething(seqNo); //3-4 seconds
seqNo = getNextSequenceFromSomeSource(oldSeq: seqNo) //1-2 seconds
storeCheckpointForTask (seqNo);
while sequence != null
If your task is running on Airflow workers, you have two options:
You can use an external storage system as checkpoints store (ex S3):
Or using Airflow Metadata (Airflow DB) as checkpoints store by saving an xcom
def my_task_func(**context):
seqNo = context["ti"].xcom_pull(key=f"checkpoint_{context['execution_date']}", default=None)
while True:
if not seqNo:
seqNo = getFirstSequenceFromSomeSource()
doSomething(seqNo)
seqNo = getNextSequenceFromSomeSource(seqNo)
if not seqNo:
break
context["ti"].xcom_push(key=f"checkpoint_{context['execution_date']}", seqNo)
And if your task is running outside Airflow, the first option is still valid, but the second no, and you will have another options depending on how you run your tasks (volumes for docker, PVC for K8S, ...).
Related
I am trying to setup airflow cluster for my project and I am using celery executor as the executor. Along with this I am using Rabbitmq as queueing service, postgresql as database. For now I have two master nodes and two worker nodes. All the services are up and running, I was able to configure my master nodes with airflow webserver and scheduler. But for my worker nodes, I am running into an issue where I get an error:
airflow command error: argument GROUP_OR_COMMAND: celery subcommand works only with CeleryExecutor, CeleryKubernetesExecutor and executors derived from them, your current executor: SequentialExecutor, subclassed from: BaseExecutor, see help above.
I did configured my airflow.cfg properly. I did set executor value to CeleryExecutor (Doesn't this mean I have set the executor value).
My airflow.cfg is as follows:
Note: I am just adding parts of the config that I think is relevant to the issue.
[celery]
# This section only applies if you are using the CeleryExecutor in
# ``[core]`` section above
# The app name that will be used by celery
celery_app_name = airflow.executors.celery_executor
# The concurrency that will be used when starting workers with the
# ``airflow celery worker`` command. This defines the number of task instances that
# a worker will take, so size up your workers based on the resources on
# your worker box and the nature of your tasks
worker_concurrency = 16
# The maximum and minimum concurrency that will be used when starting workers with the
# ``airflow celery worker`` command (always keep minimum processes, but grow
# to maximum if necessary). Note the value should be max_concurrency,min_concurrency
# Pick these numbers based on resources on worker box and the nature of the task.
# If autoscale option is available, worker_concurrency will be ignored.
# http://docs.celeryproject.org/en/latest/reference/celery.bin.worker.html#cmdoption-celery-worker-autoscale
# Example: worker_autoscale = 16,12
# worker_autoscale =
# Used to increase the number of tasks that a worker prefetches which can improve performance.
# The number of processes multiplied by worker_prefetch_multiplier is the number of tasks
# that are prefetched by a worker. A value greater than 1 can result in tasks being unnecessarily
# blocked if there are multiple workers and one worker prefetches tasks that sit behind long
# running tasks while another worker has unutilized processes that are unable to process the already
# claimed blocked tasks.
# https://docs.celeryproject.org/en/stable/userguide/optimizing.html#prefetch-limits
worker_prefetch_multiplier = 1
# Specify if remote control of the workers is enabled.
# When using Amazon SQS as the broker, Celery creates lots of ``.*reply-celery-pidbox`` queues. You can
# prevent this by setting this to false. However, with this disabled Flower won't work.
worker_enable_remote_control = true
# Umask that will be used when starting workers with the ``airflow celery worker``
# in daemon mode. This control the file-creation mode mask which determines the initial
# value of file permission bits for newly created files.
worker_umask = 0o077
# The Celery broker URL. Celery supports RabbitMQ, Redis and experimentally
# a sqlalchemy database. Refer to the Celery documentation for more information.
broker_url = amqp://admin:password#{hostname}:5672/
# The Celery result_backend. When a job finishes, it needs to update the
# metadata of the job. Therefore it will post a message on a message bus,
# or insert it into a database (depending of the backend)
# This status is used by the scheduler to update the state of the task
# The use of a database is highly recommended
# http://docs.celeryproject.org/en/latest/userguide/configuration.html#task-result-backend-settings
result_backend = db+postgresql://postgres:airflow#postgres/airflow
# The executor class that airflow should use. Choices include
# ``SequentialExecutor``, ``LocalExecutor``, ``CeleryExecutor``, ``DaskExecutor``,
# ``KubernetesExecutor``, ``CeleryKubernetesExecutor`` or the
# full import path to the class when using a custom executor.
executor = CeleryExecutor
Please let me know if I haven't added sufficient information pertinent to my problem. Thank you.
The reason for the above error could be:-
Airflow is picking the default value of the executor which is in the core section of airflow.cfg (i.e SequentialExecutor). This is the template for Airflow's default configuration. When Airflow is imported, it looks for a configuration file at $AIRFLOW_HOME/airflow.cfg. If it doesn't exist, Airflow uses this template.
The following solution is applicable if you are using the official helm chart:-
Change the default value of the executor in the core section of airflow.cfg.
Snapshot of default configuration
Pass the environment variable called AIRFLOW_HOME in the flower deployment/container. You can simply pass environment variables in all the containers by passing the following in the values file of the helm chart:
env:
- name: "AIRFLOW_HOME"
value: "/path/to/airflow/home"
In case the airflow user doesn't have access to the path you passed in the environment variable AIRFLOW_HOME, run the flower container as a root user which can be done by passing the following config in the values file of helm chat.
flower:
enabled: true
securityContext:
runAsUser: 0
We are using Airflow as orchestrator where it schedule workflow every hour. DataprocSubmitJobOperator is configured to schedule dataproc jobs (it uses spark). Spark sync data from source to target (runs for 50 min and then completes to avoid next schedule overlap).
Intermittent Airflow task fails due to zombie Exception. Logs show assertion failure due to pthread_mutex_lock(mu). Airflow Task exits. Underlying dataproc Job keeps running without issue.
Please suggest what can be potential issue/fix?
[2021-12-22 23:01:17,150] {dataproc.py:1890} INFO - Submitting job
[2021-12-22 23:01:17,804] {dataproc.py:1902} INFO - Job 27a2c88d-1308-4407-b965-aa490e2217fb submitted successfully.
[2021-12-22 23:01:17,805] {dataproc.py:1905} INFO - Waiting for job 27a2c88d-1308-4407-b965-aa490e2217fb to complete
E1222 23:45:58.299007027 1267 sync_posix.cc:67] assertion failed: pthread_mutex_lock(mu) == 0
[2021-12-22 23:46:00,943] {local_task_job.py:102} INFO - Task exited with return code Negsignal.SIGABRT
Config
raw_data_sync = DataprocSubmitJobOperator(
task_id="raw_data_sync",
job=RAW_DATA_GENERATION,
location='us-central1',
project_id='1f780b38bd7b0384e53292de20',
execution_timeout=timedelta(seconds=3420),
dag=dag
)
I've got two tasks.
Bash Operator [kinit], which takes kerberos ticket for hadoop
Hive Sensor [check_partition ], which checks if partition exists.
My problem is that, Kerberos ticket is valid for 9 hours while the hive sensor might wait from 1 to 15 hours, because the time when data arrives is really fickle. Therefore I would like to execute kinit each time the hive sensor is reschedule (by 1 hour).
kinit = BashOperator(
task_id="CIDF_BASH_KINIT",
bash_command="bash kinit command",
dag=dag
)
check_partition = HiveCLIPartitionSensor(
task_id="CIDF_BASH_HIVE_CHECK_PARTITION",
table='table',
partition="partition='{}'".format('{{ ds }}'),
poke_interval=60*60,
mode='reschedule',
retries=0,
timeout=60*60*23,
dag=dag
)
kinit >> check_partition
you can run a cron job or something scheduled on the background that generates a kerberos ticket every 5-6 hours automatically.
I have a DAG 'abc' scheduled to run every day at 7 AM CST and there is task 'xyz' in that DAG.
For some reason, I do not want to run one of the tasks 'xyz' for tomorrow's instance.
How can I skip that particular task instance?
I do not want to make any changes to code as I do not have access to Prod code and the task is in Prod environment now.
Is there any way to do that using command line ?
Appreciate any help on this.
You can mark the unwanted tasks as succeeded using the run command. The tasks marked as succeeded will not be run anymore.
Assume, there is a DAG with ID a_dag and three tasks with IDs dummy1, dummy2, dummy3. We want to skipp the dummy3 task from the next DAG run.
First, we get the next execution date:
$ airflow next_execution a_dag
2020-06-12T21:00:00+00:00
Then we mark dummy3 as succeeded for this execution date:
$ airflow run -fAIim a_dag dummy3 '2020-06-12T21:00:00+00:00'
To be sure, we can check the task state. For the skipped task it will be success:
$ airflow task_state a_dag dummy3 '2020-06-12T21:00:00+00:00'
...
success
For the rest of the tasks the state will be None:
$ airflow task_state a_dag dummy1 '2020-06-12T21:00:00+00:00'
...
None
Our airflow installation is using CeleryExecutor.
The concurrency configs were
# The amount of parallelism as a setting to the executor. This defines
# the max number of task instances that should run simultaneously
# on this airflow installation
parallelism = 16
# The number of task instances allowed to run concurrently by the scheduler
dag_concurrency = 16
# Are DAGs paused by default at creation
dags_are_paused_at_creation = True
# When not using pools, tasks are run in the "default pool",
# whose size is guided by this config element
non_pooled_task_slot_count = 64
# The maximum number of active DAG runs per DAG
max_active_runs_per_dag = 16
[celery]
# This section only applies if you are using the CeleryExecutor in
# [core] section above
# The app name that will be used by celery
celery_app_name = airflow.executors.celery_executor
# The concurrency that will be used when starting workers with the
# "airflow worker" command. This defines the number of task instances that
# a worker will take, so size up your workers based on the resources on
# your worker box and the nature of your tasks
celeryd_concurrency = 16
We have a dag that executes daily. It has around some tasks in parallel following a pattern that senses whether the data exists in hdfs then sleep 10 mins, and finally upload to s3.
Some of the tasks has been encountering the following error:
2019-05-12 00:00:46,212 ERROR - Executor reports task instance <TaskInstance: example_dag.task1 2019-05-11 04:00:00+00:00 [queued]> finished (failed) although the task says its queued. Was the task killed externally?
2019-05-12 00:00:46,558 INFO - Marking task as UP_FOR_RETRY
2019-05-12 00:00:46,561 WARNING - section/key [smtp/smtp_user] not found in config
This kind of error occurs randomly in those tasks. When this error happens, the state of task instance is immediately set to up_for_retry, and no logs in the worker nodes. After some retries, they execute and finished eventually.
This problem sometimes gives us large ETL delay. Anyone knows how to solve this problem?
We were facing similar problems , which was resolved by
"-x, --donot_pickle" option.
For more information :- https://airflow.apache.org/cli.html#backfill
I was seeing very similar symptoms in my DagRuns. I thought it was due to the ExternalTaskSensor and concurrency issues given the queuing and killed task language that looked like this: Executor reports task instance <TaskInstance: dag1.data_table_temp_redshift_load 2019-05-20 08:00:00+00:00 [queued]> finished (failed) although the task says its queued. Was the task killed externally? But when I looked at the worker logs, I saw there was an error caused by setting a variable with Variable.set in my DAG. The issue is described here duplicate key value violates unique constraint when adding path variable in airflow dag where the scheduler polls the dagbag at regular intervals to refresh any changes dynamically. The error with every heartbeat was causing significant ETL delays.
Are you performing any logic in your wh_hdfs_to_s3 DAG (or others) that might be causing errors or delays / these symptoms?
We fixed this already. Let me answer myself question:
We have 5 airflow worker nodes. After installing flower to monitor the tasks distributed to these nodes. We found out that the failed task was always sent to a specific node. We tried to use airflow test command to run the task in other nodes and they worked. Eventually, the reason was a wrong python package in that specific node.