I have an Airflow Http sensor that calls a REST endpoint and checks for a specific value in the JSON structure returned by the API
sensor = HttpSensor(
soft_fail=True,
task_id='http_sensor_check',
http_conn_id='http_default',
endpoint='http://localhost:8082/api/v1/resources/games/all',
request_params={},
response_check=lambda response: True if check_api_response(response) is True else False,
mode='reschedule',
dag=dag)
If the response_check is false, the DAG is put in a "up_for_reschedule" state. The issue is, the DAG stayed in that status forever and never got rescheduled.
My questions are:
What does "up_for_reschedule" means? and when would be the DAG rescheduled?
Let's suppose my DAG is scheduled to run every 5 minutes but because of the sensor, the "up_for_reschedule" DAG instance overlaps with the new run, will I have 2 DAGS running at the same time?
Thank you in advance.
In sensor mode='reschedule' means that if the criteria of the sensor isn't True then the sensor will release the worker to other tasks. This is very useful for cases when sensor may wait for a long time.
up_for_reschedule means that the sensor condition isn't true yet and it hasnt reached timout so the task is waiting to be rescheduled by the scheduler.
You don't know when the task will run. That depends on the scheduler (available resources, priorities etc..). If you don't want to allow parallel dag runs use max_active_runs=1 in DAG constructor.
Side note:
response_check=lambda response: True if check_api_response(response) is True else False,
is the same as:
response_check=lambda response: check_api_response(response),
Related
in our orchestration we are using one DAG to trigger other child dags and till these child dags not finish our master DAG is running to check the status. we used sleep of 5 min to check status of child dags after each 5 min. As this task is continually in running state so its consuming resources on one worker. recently we came to know about up_for_reschedule. does this solve my problem to release the worker ? is it possible to use up_for_reschedule with python operator ? if yes is there any document which i can refer ?
There are a couple of options here.
You can use PythonSensor instead of Operator in reschedule mode. In reschedule mode, the airflow will reschedule the task instance if the sensor con
task = PythonSensor(
task_id='sensor_example',
mode='reschedule',
python_callable=func
)
You can link a bunch of TriggerDagRunOperoator. But this is best if you just want to create the dog runs and not wait for status checks. Unlike sensors, it doesn't have a reschedule mode. So when wait_for_completion is True, it will hold the worker slot
I would like to trigger an Airflow DAD based on SQS messages. I am quite new to Airflow but this is how I think it should be done:
Option 1
Use the Airflow SQS Sensor. From my understanding, this waits on SQS messages to proceed with the execution of an already trigger DAG. Does this mean a DAG would always need to be running and waiting for SQS messages to catch any eventual new messages and process them? Does this also mean I should schedule my DAG on a very short interval so that when an SQS message gets handled by a DAG, another DAG is created to handle the next SQS messages?
Option 2
Add a lambda or something watching for SQS messages and using the Airflow API to trigger DAGs when needed.
Eventually, I would like to minimise the number of interactions needed to trigger a DAG so I would like to use an Airflow built-in way of watching SQS.
Thank you
Both options are valid however Option 2 is basically an alternative implementation to sensor. I think the better solution is Option 1 with some modification:
Use SQSSensor but with mode='reschedule' that way every once in a while the sensor is "awaking" checking if the criteria is met. Note that this is not like sleep(x). When the criteria isn't met Airflow will release the worker for other tasks that needs to run and return the SQSSensor to the scheduling queue.
You can read more about the sensor modes in the docs.
from airflow.providers.amazon.aws.sensors.sqs import SQSSensor
SQSSensor(
task_id='test_task',
dag=dag,
sqs_queue='your_queue',
aws_conn_id='aws_default',
mode='reschedule')
Note that the sensor will run indefinitely until the criteria is met. You can set timeout on the sensor task (there are other possible reasons for timeout like cluster policy and other defaults but that is another topic).
Facing a scenrio in Apache - Airflow , where the ask is to stop the DAG execution on the event of an sla miss and proceed with subsequent DAG interval's execution
Is such a functionality configurable in Airflow ?
You can create a lightweight DAG that runs every X (the smallest interval within your Airflow cluster) which checks for SLA misses and pauses the DAG if there's any.
I have a DAG which takes a very long time to do a bigquery operation. And always i get the error 'Broken DAG: [/home/airflow/gcs/dags/xyz.py] Timeout'
I found some answers saying that we have to increase the timeout in airflow.cfg. But that idea is not suitable in my project. Is it possible to somehow increase the timeout for a particular DAG? Anybody please help. Thank you.
Yes you can set dagrun_timeout parameter on the Dag.
Specify how long a DagRun should be up before
timing out / failing, so that new DagRuns can be created. The timeout
is only enforced for scheduled DagRuns, and only once the
# of active DagRuns == max_active_runs.
We also have a parameter execution_timeout on each Task that you can set.
execution_timeout: max time allowed for the execution of
this task instance, if it goes beyond it will raise and fail.
:type execution_timeout: datetime.timedelta
So if one of the task is running a query on BigQuery you can use something like
BigQueryOperator(sql=sql,
destination_dataset_table={{ params.t_name }}),
task_id='bq_query',
bigquery_conn_id='my_bq_connection',
use_legacy_sql=False,
write_disposition='WRITE_TRUNCATE',
create_disposition='CREATE_IF_NEEDED',
query_params={'t_name': table_name},
execution_timeout=datetime.timedelta(minutes=10)
dag=dag)
I've been setting up an airflow cluster on our system and previously it has been working. I'm not sure what I may have done to change this.
I have a DAG which I want to run on a schedule. To make sure it's working I'd also like to trigger it manually. Neither of these seem to be working at the moment and no logs seem to be being written for the task instances. The only logs available are the airflow scheduler logs which generally look healthy.
I am just constantly met with this message:
Task is not ready for retry yet but will be retried automatically. Current date is 2018-12-12T11:34:46.978355+00:00 and task will be retried at 2018-12-12T11:35:08.093313+00:00.
However, if I wait a little the exact same message is presented again except the times have moved forward a little. Therefore, it seems the task is never actually being retried.
I am using a LocalExecutor and the task is an SSHOperator. Simplified code below. All it does is ssh onto another machine and start a bunch of application with a pre-determined directory structure.:
DB_INFO_FILE = 'info.json'
START_SCRIPT = '/bin/start.sh'
TIME_IN_PAST = timezone.convert_to_utc(datetime.today() -
timedelta(days=1))
DEFAULT_ARGS = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': TIME_IN_PAST,
'email': ['some_email#blah.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=1),
}
def _extract_instance_id(instance_string):
return re.findall(r'\d+', instance_string)[0]
def _read_file_as_json(file_name):
with open(file_name) as open_file:
return json.load(open_file)
DB_INFO = _read_file_as_json(DB_INFO_FILE)
CONFIG_CLIENT = ConfigDbClient(**DB_INFO)
APP_DIRS = CONFIG_CLIENT.get_values('%my-app-info%')
INSTANCE_START_SCRIPT_PATHS = {
_extract_instance_id(instance_string): directory+START_SCRIPT
for instance_string, directory in APP_DIRS.items()
}
# Create an ssh hook which refers to pre-existing connection information
# setup and stored by airflow
SSH_HOOK = SSHHook(ssh_conn_id='my-conn-id')
# Create a DAG object to add tasks to
DAG = DAG('my-dag-id',
default_args=DEFAULT_ARGS)
# Create a task for each app instance.
for instance_id, start_script in INSTANCE_START_SCRIPT_PATHS.items():
task = SSHOperator(
task_id='run-script-{0}'.format(instance_id),
command='bash {0}'.format(start_script),
ssh_hook=SSH_HOOK,
dag=DAG)
It works when I run the tasks individually, via the command line but not via the UI. It seems I can run tasks but I simply cannot trigger a DAG to run. I've tried many combinations of start_date s and interval schedules just to sanity check also.
Any ideas?
And yes, I am aware this question has been asked before and I have looked at all of them but not of the solutions have helped me.
Oh. Your start_date is changing at the same rate or faster than the schedule interval period.
Here's what the scheduler is seeing every couple of seconds:
start_date: 2018-12-11T12:12:12.000Z # E.G. IFF now is 2018-12-12T12:12:12.000Z, a day ago
schedule_interval: timedelta(days=1) # The default
Here's what the scheduler needs for a DAG to run: The last time a run occurred was more than one schedule interval ago. If no scheduled run has occurred, the first scheduled run could start now if one full schedule interval has passed since the start_date as that is the earliest allowable date for execution_date. In which case the dag_run with the execution_date set to the start of that interval period should be created. Then task_instances can be created for any tasks in the DAG whose dependencies are met as long as the task_instance execution_date is after the start_date of the DAG (this is not stored on the dag_run object but recomputed by loading the DAG file just while inspecting the dag's state).
So it's not getting scheduled automatically for the reason that the start date keeps changing just as the interval is satisfied. However if it were -2d at least one run would get scheduled and then any further runs would have to wait until it's 1d after that to be scheduled. It's easier though if you just set a fixed datetime on your start_date.
But what about those odd retries on your manual runs…
You did start a manual run or two. These runs took the current time as the execution_date unless you specified something else. This should be after the start_date, at least until tomorrow, which should clear them to run. But then it seems in your logs you're seeing that they're failing and being marked for retry and also not decrementing your retries. I'm not sure why that would be but could it be that something isn't right with the SSHOperator.
Did you install airflow with the [ssh] extra so that SSHOperator's dependencies are met (specifically paramiko and sshtunnel) on both the webserver and scheduler? One of them is working because I assume it's parsing and showing up in the UI based on being added to the DB.
What do you get if you execute:
airflow test my-dag-id run-script-an_instance_id 2018-12-12T12:12:12
You know that the scheduler and webserver are looping over refilling their DAG bag and so rerunning this DAG file a few 1000 times a day, reloading that json (it's local access, so similar to importing a module), and recreating that SSHHook with a DB lookup. I don't see anything fancy done setting up this hook, why not just remove ssh_hook from the SSHOperator and replace it with ssh_conn_id='my-conn-id' so it can be created once at execution time?
I doubt that's the issue that's causing the retries that just roll forward though.
I had a task stuck in up_for_retry for almost 24 hours before I noticed it, and it had nothing to do with the start_date, end_date, or any other classic beginner's problem.
I resorted to reading the source code, and found that Airflow treats up_for_retry tasks differently if they are part of a backfill DAG run.
So I connected to the metadata DB and changed backfill_ to scheduled__ in the dag_run row corresponding to the stuck task. Airflow started running the stuck task immediately.