Airflow tasks are queued after added retry_exponential_backoff to true - airflow

Issue Summary
Few of my airflow tasks are still queued even after the next run is executed fine.
args
default_args = {
'owner': 'my-owner',
'depends_on_past': False,
'email': ['email#org.com']
'email_on_failure': True,
'email_on_retry': True,
'retries': 2,
'retry_exponential_backoff': True,
'retry_delay': datetime.timedelta(minutes=5),
'start_date': datetime.datetime(2020, 4, 1)
}
Before implementing retry_exponential_backoff the dag was failed due to Exceeded rate limits: too many table update operations for this table.
After implementing retry_exponential_backoff to True there is no failure due to Exceeded rate limits but so far for one run I could see few tasks are still not yet triggered even after the next run executed.
What could be the issue? Appreciate your support.

Usually, you can click on the queued (grey) task and select the first option: "Instance Details" Then on the first upper part you can read the section Task Instance Details which has details about the state of the task.
Now about your use case, we still need more details: airflow version, your schedule_interval, more logs, what Executors you are using, num_runs, concurrency, etc.
Please check the official Airflow Documentation here

Related

How does the mode "reschedule" in Airflow Sensors work?

I have an Airflow Http sensor that calls a REST endpoint and checks for a specific value in the JSON structure returned by the API
sensor = HttpSensor(
soft_fail=True,
task_id='http_sensor_check',
http_conn_id='http_default',
endpoint='http://localhost:8082/api/v1/resources/games/all',
request_params={},
response_check=lambda response: True if check_api_response(response) is True else False,
mode='reschedule',
dag=dag)
If the response_check is false, the DAG is put in a "up_for_reschedule" state. The issue is, the DAG stayed in that status forever and never got rescheduled.
My questions are:
What does "up_for_reschedule" means? and when would be the DAG rescheduled?
Let's suppose my DAG is scheduled to run every 5 minutes but because of the sensor, the "up_for_reschedule" DAG instance overlaps with the new run, will I have 2 DAGS running at the same time?
Thank you in advance.
In sensor mode='reschedule' means that if the criteria of the sensor isn't True then the sensor will release the worker to other tasks. This is very useful for cases when sensor may wait for a long time.
up_for_reschedule means that the sensor condition isn't true yet and it hasnt reached timout so the task is waiting to be rescheduled by the scheduler.
You don't know when the task will run. That depends on the scheduler (available resources, priorities etc..). If you don't want to allow parallel dag runs use max_active_runs=1 in DAG constructor.
Side note:
response_check=lambda response: True if check_api_response(response) is True else False,
is the same as:
response_check=lambda response: check_api_response(response),

Send failure email to recipients based on variable using Jinja

I'm trying to send an email to certain people defined in a variable in the Airflow UI. I see a similar question here but mine is slightly different. I suspect I can't do this when I'm using Jinja templating?
Apache airflow sends sla miss emails only to first person on the list
My variable:
{ "config1" : "abc", "config2":500000, "email_on_failure_list": ["jorge#gmail.com","maria#gmail.com"]}
My dag. it's bad practice to use Variable.Get() in top level code when referencing a variable created in the Airflow UI. so I'm using Jinja templating but it doesn't seem to be working
email = "{{ var.json.my_dag.email_on_failure_list}}"
default_args = {
'depends_on_past': False,
'start_date': datetime(2020, 9, 1),
'email_on_failure': True,
'email': email
}
this is the error I receive
ERROR - Failed to send email to: {{ var.json.my_dag.email_on_failure_list}}
[2021-02-09 09:13:16,339] {{taskinstance.py:1201}} ERROR - {'{{ var.json.my_dag.email_on_failure_list}}': (501, b'5.1.3 Invalid address')}
Short answer: it's not possible with Jinja templates.
The alternative that was suggested on the Airflow Slack channel (Thanks Kaxil!) is to create an Environment Variable. I can then reference this using Variable.Get without any of the performance issues.
Environment variables aren't ideal for something non-sensitive since it requires more access to modify but it's a decent alternative to have different configuration settings across different Airflow environments (dev, qa, prod).
Follow the instructions here:
https://airflow.apache.org/docs/apache-airflow/2.0.1/howto/variable.html#storing-variables-in-environment-variables
Steps:
Create an environment variable on the airflow server (ex. AIRFLOW_VAR_MY_EMAIL_LIST). It's important to not create the variable in the Airflow UI
use it in your dag
from airflow.models import Variable
my_email_list = Variable.get("AIRFLOW_VAR_MY_EMAIL_LIST")
default_args = {
'depends_on_past': False,
'start_date': datetime(2020, 9, 1),
'email_on_failure': True,
'email': my_email_list
}

Airflow task in status 'scheduled' and stucks

I run backfill command and all preceeding tasks are done except for the last 4 tasks.Their status are scheduled and nothing happens at all. I have run airflow test before and succeeded,but now I don't know what to do.
I specified a none-existing pool.But I think this won't matter because
preceeding tasks run smoothly.
Uhm no;
AFAIK, non-existent pool results in tasks getting stuck into a limbo state
But I've found that even while DAG is running, adding / removing pools or modifying the count of slots in pool works; so try creating the pool with adequate number of slots
I'm glad to join the conversation. Currently, I already created the pool with adequate number of slots, but the problem is still here. As this picture shows us: enter image description here
And besides, I set the priority_weight of the dag. It works! These tasks begin to running. The change like this:
default_args = {
'owner': 'Airflow',
'retries': 2,
'queue': 'queue_name',
'pool': 'pool_name',
'priority_weight': 10, # the larger the number, the higher the weight
'retry_delay': timedelta(seconds=10)}

Apache Airflow tasks are stuck in a 'up_for_retry' state

I've been setting up an airflow cluster on our system and previously it has been working. I'm not sure what I may have done to change this.
I have a DAG which I want to run on a schedule. To make sure it's working I'd also like to trigger it manually. Neither of these seem to be working at the moment and no logs seem to be being written for the task instances. The only logs available are the airflow scheduler logs which generally look healthy.
I am just constantly met with this message:
Task is not ready for retry yet but will be retried automatically. Current date is 2018-12-12T11:34:46.978355+00:00 and task will be retried at 2018-12-12T11:35:08.093313+00:00.
However, if I wait a little the exact same message is presented again except the times have moved forward a little. Therefore, it seems the task is never actually being retried.
I am using a LocalExecutor and the task is an SSHOperator. Simplified code below. All it does is ssh onto another machine and start a bunch of application with a pre-determined directory structure.:
DB_INFO_FILE = 'info.json'
START_SCRIPT = '/bin/start.sh'
TIME_IN_PAST = timezone.convert_to_utc(datetime.today() -
timedelta(days=1))
DEFAULT_ARGS = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': TIME_IN_PAST,
'email': ['some_email#blah.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=1),
}
def _extract_instance_id(instance_string):
return re.findall(r'\d+', instance_string)[0]
def _read_file_as_json(file_name):
with open(file_name) as open_file:
return json.load(open_file)
DB_INFO = _read_file_as_json(DB_INFO_FILE)
CONFIG_CLIENT = ConfigDbClient(**DB_INFO)
APP_DIRS = CONFIG_CLIENT.get_values('%my-app-info%')
INSTANCE_START_SCRIPT_PATHS = {
_extract_instance_id(instance_string): directory+START_SCRIPT
for instance_string, directory in APP_DIRS.items()
}
# Create an ssh hook which refers to pre-existing connection information
# setup and stored by airflow
SSH_HOOK = SSHHook(ssh_conn_id='my-conn-id')
# Create a DAG object to add tasks to
DAG = DAG('my-dag-id',
default_args=DEFAULT_ARGS)
# Create a task for each app instance.
for instance_id, start_script in INSTANCE_START_SCRIPT_PATHS.items():
task = SSHOperator(
task_id='run-script-{0}'.format(instance_id),
command='bash {0}'.format(start_script),
ssh_hook=SSH_HOOK,
dag=DAG)
It works when I run the tasks individually, via the command line but not via the UI. It seems I can run tasks but I simply cannot trigger a DAG to run. I've tried many combinations of start_date s and interval schedules just to sanity check also.
Any ideas?
And yes, I am aware this question has been asked before and I have looked at all of them but not of the solutions have helped me.
Oh. Your start_date is changing at the same rate or faster than the schedule interval period.
Here's what the scheduler is seeing every couple of seconds:
start_date: 2018-12-11T12:12:12.000Z # E.G. IFF now is 2018-12-12T12:12:12.000Z, a day ago
schedule_interval: timedelta(days=1) # The default
Here's what the scheduler needs for a DAG to run: The last time a run occurred was more than one schedule interval ago. If no scheduled run has occurred, the first scheduled run could start now if one full schedule interval has passed since the start_date as that is the earliest allowable date for execution_date. In which case the dag_run with the execution_date set to the start of that interval period should be created. Then task_instances can be created for any tasks in the DAG whose dependencies are met as long as the task_instance execution_date is after the start_date of the DAG (this is not stored on the dag_run object but recomputed by loading the DAG file just while inspecting the dag's state).
So it's not getting scheduled automatically for the reason that the start date keeps changing just as the interval is satisfied. However if it were -2d at least one run would get scheduled and then any further runs would have to wait until it's 1d after that to be scheduled. It's easier though if you just set a fixed datetime on your start_date.
But what about those odd retries on your manual runs…
You did start a manual run or two. These runs took the current time as the execution_date unless you specified something else. This should be after the start_date, at least until tomorrow, which should clear them to run. But then it seems in your logs you're seeing that they're failing and being marked for retry and also not decrementing your retries. I'm not sure why that would be but could it be that something isn't right with the SSHOperator.
Did you install airflow with the [ssh] extra so that SSHOperator's dependencies are met (specifically paramiko and sshtunnel) on both the webserver and scheduler? One of them is working because I assume it's parsing and showing up in the UI based on being added to the DB.
What do you get if you execute:
airflow test my-dag-id run-script-an_instance_id 2018-12-12T12:12:12
You know that the scheduler and webserver are looping over refilling their DAG bag and so rerunning this DAG file a few 1000 times a day, reloading that json (it's local access, so similar to importing a module), and recreating that SSHHook with a DB lookup. I don't see anything fancy done setting up this hook, why not just remove ssh_hook from the SSHOperator and replace it with ssh_conn_id='my-conn-id' so it can be created once at execution time?
I doubt that's the issue that's causing the retries that just roll forward though.
I had a task stuck in up_for_retry for almost 24 hours before I noticed it, and it had nothing to do with the start_date, end_date, or any other classic beginner's problem.
I resorted to reading the source code, and found that Airflow treats up_for_retry tasks differently if they are part of a backfill DAG run.
So I connected to the metadata DB and changed backfill_ to scheduled__ in the dag_run row corresponding to the stuck task. Airflow started running the stuck task immediately.

How to get airflow to add thousands of tasks to celery at one time?

I'm evaluating Airflow 1.9.0 for our distributed orchestration needs (using CeleryExecutor and RabbitMQ), and I am seeing something strange.
I made a dag that has three stages:
start
fan out and run N tasks concurrently
finish
N can be large, maybe up to 10K. I would expect to see N tasks get dumped onto the Rabbit queue when stage 2 begins. Instead I am seeing only a few hundred added at a time. As the workers process the tasks and the queue gets smaller, then more get added to Celery/Rabbit. Eventually, it does finish, however I would really prefer that it dump ALL the work (all 10K tasks) into Celery immediately, for two reasons:
The current way makes the scheduler long-lived and stateful. The scheduler might die after only 5K have completed, in which case the remaining 5K tasks would never get added (I verified this)
I want to use the size of the Rabbit queue as metric to trigger autoscaling events to add more workers. So I need a true picture of how much outstanding work remains (10K, not a few hundred)
I assume the scheduler has some kind of throttle that keeps it from dumping all 10K messages simultaneously? If so is this configurable?
FYI I have already set “parallelism” to 10K in the airflow.cfg
Here is my test dag:
# This dag tests how well airflow fans out
from airflow import DAG
from datetime import datetime, timedelta
from airflow.operators.bash_operator import BashOperator
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2015, 6, 1),
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
dag = DAG('fan_out', default_args=default_args, schedule_interval=None)
num_tasks = 10000
starting = BashOperator(
task_id='starting',
bash_command='echo starting',
dag=dag
)
all_done = BashOperator(
task_id='all_done',
bash_command='echo all done',
dag=dag)
for i in range(0, num_tasks):
task = BashOperator(
task_id='say_hello_' + str(i),
bash_command='echo hello world',
dag=dag)
task.set_upstream(starting)
task.set_downstream(all_done)
There are a couple other settings you'll want to increase.
Under [core] increase non_pooled_task_slot_count. This will allow more tasks to actually be queued up in celery.
Under [celery] increase celeryd_concurrency. This will increase the number of tasks each worker will attempt to run from the queue at the same time.
That being said, in response to your first reason...
While true, the remaining tasks won't get queued if the scheduler isn't running, but this is because the Airflow scheduler is designed to be long lived. It should always be running when your workers are running. Should the scheduler be killed or die for whatever reason, once it starts back up it will pick up where it left off.
Thanks to those who suggested other concurrency settings. Through trial and error I learned that I need to set all three of these:
- AIRFLOW__CORE__PARALLELISM=10000
- AIRFLOW__CORE__NON_POOLED_TASK_SLOT_COUNT=10000
- AIRFLOW__CORE__DAG_CONCURRENCY=10000
With only these two enabled, I can get to 10K but it is very slow, only adding 100 new tasks in bursts every 30 seconds, in a stair-step fashion:
- AIRFLOW__CORE__PARALLELISM=10000
- AIRFLOW__CORE__NON_POOLED_TASK_SLOT_COUNT=10000
If I only enable these two, it is the same "stair-step" pattern, with 128 added every 30 seconds:
- AIRFLOW__CORE__PARALLELISM=10000
- AIRFLOW__CORE__DAG_CONCURRENCY=10000
But if I set all three, it does add 10K to the queue in one shot.

Resources