I have a DAG which takes a very long time to do a bigquery operation. And always i get the error 'Broken DAG: [/home/airflow/gcs/dags/xyz.py] Timeout'
I found some answers saying that we have to increase the timeout in airflow.cfg. But that idea is not suitable in my project. Is it possible to somehow increase the timeout for a particular DAG? Anybody please help. Thank you.
Yes you can set dagrun_timeout parameter on the Dag.
Specify how long a DagRun should be up before
timing out / failing, so that new DagRuns can be created. The timeout
is only enforced for scheduled DagRuns, and only once the
# of active DagRuns == max_active_runs.
We also have a parameter execution_timeout on each Task that you can set.
execution_timeout: max time allowed for the execution of
this task instance, if it goes beyond it will raise and fail.
:type execution_timeout: datetime.timedelta
So if one of the task is running a query on BigQuery you can use something like
BigQueryOperator(sql=sql,
destination_dataset_table={{ params.t_name }}),
task_id='bq_query',
bigquery_conn_id='my_bq_connection',
use_legacy_sql=False,
write_disposition='WRITE_TRUNCATE',
create_disposition='CREATE_IF_NEEDED',
query_params={'t_name': table_name},
execution_timeout=datetime.timedelta(minutes=10)
dag=dag)
Related
We have an ETL DAG which is executed daily. DAG and tasks have the following parameters:
catchup=False
max_active_runs=1
depends_on_past=True
When we add a new task, due to depends_on_past property, no new DAG runs get scheduled, as all previous states for new task are missing.
We would like to avoid having to run manual backfill or manually marking previous runs from UI as it can be easily forgotten, and we also have some dynamic DAGs where tasks get added automatically and halt future DAG executions.
Is there a way to automatically set past executions for new tasks as skipped by default, or some other approach that will allow future DAG runs to execute without human intervention?
We also considered creating a maintenance DAG that would insert missing task executions with skipped state, but would rather not go this route.
Are we missing something as the flow looks like a common thing to do?
Defined in Airflow documentation on BaseOperator:
depends_on_past (bool) – when set to true, task instances will run
sequentially and only if the previous instance has succeeded or has
been skipped. The task instance for the start_date is allowed to run.
As long as there exists a previous instance of the task, if that previous instance is not in the success state, the current instance of the task cannot run.
When adding a task to a DAG with existing dagrun, Airflow will create the missing task instances in the None state for all dagruns. Unfortunately, it is not possible to set the default state of task instances.
I do not believe there is a way to allow future task instances of a DAG with existing dagruns to run without human intervention. Personally, for depends_on_past enabled tasks, I will mark the previous task instance as success either through the CLI or the Airflow UI.
Looks like there is an Github Issue describing exactly what you are experiencing! Feel free to bump this PR or take a stab at it if you would like.
A hacky solution is to set depends_on_past to False as max_active_runs=1 will implicitly guarantee the same behavior. As of the current Airflow version, the scheduler orders both dag runs and task instances by execution date before running them (checked 1.10.x but also 2.0)
Another difference is that next execution will be scheduled even if previous fails. We solved this by retrying unlimited times (setting a ver large number), and alert if retry number is larger than some value.
Airflow : Is it possible to configure task-level timeout in a DAG .
I wished to prevent a task from running indefinitely . According to my understanding sla parameters will come in place only once the task gets completed and overshoots the SLA .
For timeouts on Operators in Airflow you can add the execution_timeout parameter. From the docs:
exuction_timeout (datetime.timedelta) – max time allowed for the execution of this task instance, if it goes beyond it will raise and fail
It expects a datetime.timedelta, e.g. timedelta(hours=1) for a max of 1 hour for the task.
Note that for sensors execution_timeout does not work. Sensors expect a timeout parameter instead.
Is there a way specify that a task can only run once concurrently? So in the tree above where DAG concurrency is 4, Airflow will start task 4 instead of a second instance of task 2?
This DAG is a little special because there is no order between the tasks. These tasks are independent but related in purpose and therefore kept in one DAG so as to new create an excessive number of single task DAGs.
max_active_runs is 2 and dag_concurrency is 4. I would like it start all 4 tasks and only start a task in next if same task in previous run is done.
I may have mis-understood your question, but I believe you are wanting to have all the tasks in a single dagrun finish before the tasks begin in the next dagrun. So a DAG will only execute once the previous execution is complete.
If that is the case, you can make use of the max_active_runs parameter of the dag to limit how many running concurrent instances of a DAG there are allowed to be.
More information here (refer to the last dotpoint): https://airflow.apache.org/faq.html#why-isn-t-my-task-getting-scheduled
max_active_runs defines how many running concurrent instances of a DAG there are allowed to be.
Airflow operator documentation describes argument task_concurrency. Just set it to one.
From the official docs for trigger rules:
depends_on_past (boolean) when set to True, keeps a task from getting triggered if the previous schedule for the task hasn’t succeeded.
So the future DAGs will wait for the previous ones to finish successfully before executing.
On airflow.cfg under [core]. You will find
dag_concurrency = 16
//The number of task instances allowed to run concurrently by the scheduler
you're free to change this to what you desire.
I've been setting up an airflow cluster on our system and previously it has been working. I'm not sure what I may have done to change this.
I have a DAG which I want to run on a schedule. To make sure it's working I'd also like to trigger it manually. Neither of these seem to be working at the moment and no logs seem to be being written for the task instances. The only logs available are the airflow scheduler logs which generally look healthy.
I am just constantly met with this message:
Task is not ready for retry yet but will be retried automatically. Current date is 2018-12-12T11:34:46.978355+00:00 and task will be retried at 2018-12-12T11:35:08.093313+00:00.
However, if I wait a little the exact same message is presented again except the times have moved forward a little. Therefore, it seems the task is never actually being retried.
I am using a LocalExecutor and the task is an SSHOperator. Simplified code below. All it does is ssh onto another machine and start a bunch of application with a pre-determined directory structure.:
DB_INFO_FILE = 'info.json'
START_SCRIPT = '/bin/start.sh'
TIME_IN_PAST = timezone.convert_to_utc(datetime.today() -
timedelta(days=1))
DEFAULT_ARGS = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': TIME_IN_PAST,
'email': ['some_email#blah.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=1),
}
def _extract_instance_id(instance_string):
return re.findall(r'\d+', instance_string)[0]
def _read_file_as_json(file_name):
with open(file_name) as open_file:
return json.load(open_file)
DB_INFO = _read_file_as_json(DB_INFO_FILE)
CONFIG_CLIENT = ConfigDbClient(**DB_INFO)
APP_DIRS = CONFIG_CLIENT.get_values('%my-app-info%')
INSTANCE_START_SCRIPT_PATHS = {
_extract_instance_id(instance_string): directory+START_SCRIPT
for instance_string, directory in APP_DIRS.items()
}
# Create an ssh hook which refers to pre-existing connection information
# setup and stored by airflow
SSH_HOOK = SSHHook(ssh_conn_id='my-conn-id')
# Create a DAG object to add tasks to
DAG = DAG('my-dag-id',
default_args=DEFAULT_ARGS)
# Create a task for each app instance.
for instance_id, start_script in INSTANCE_START_SCRIPT_PATHS.items():
task = SSHOperator(
task_id='run-script-{0}'.format(instance_id),
command='bash {0}'.format(start_script),
ssh_hook=SSH_HOOK,
dag=DAG)
It works when I run the tasks individually, via the command line but not via the UI. It seems I can run tasks but I simply cannot trigger a DAG to run. I've tried many combinations of start_date s and interval schedules just to sanity check also.
Any ideas?
And yes, I am aware this question has been asked before and I have looked at all of them but not of the solutions have helped me.
Oh. Your start_date is changing at the same rate or faster than the schedule interval period.
Here's what the scheduler is seeing every couple of seconds:
start_date: 2018-12-11T12:12:12.000Z # E.G. IFF now is 2018-12-12T12:12:12.000Z, a day ago
schedule_interval: timedelta(days=1) # The default
Here's what the scheduler needs for a DAG to run: The last time a run occurred was more than one schedule interval ago. If no scheduled run has occurred, the first scheduled run could start now if one full schedule interval has passed since the start_date as that is the earliest allowable date for execution_date. In which case the dag_run with the execution_date set to the start of that interval period should be created. Then task_instances can be created for any tasks in the DAG whose dependencies are met as long as the task_instance execution_date is after the start_date of the DAG (this is not stored on the dag_run object but recomputed by loading the DAG file just while inspecting the dag's state).
So it's not getting scheduled automatically for the reason that the start date keeps changing just as the interval is satisfied. However if it were -2d at least one run would get scheduled and then any further runs would have to wait until it's 1d after that to be scheduled. It's easier though if you just set a fixed datetime on your start_date.
But what about those odd retries on your manual runs…
You did start a manual run or two. These runs took the current time as the execution_date unless you specified something else. This should be after the start_date, at least until tomorrow, which should clear them to run. But then it seems in your logs you're seeing that they're failing and being marked for retry and also not decrementing your retries. I'm not sure why that would be but could it be that something isn't right with the SSHOperator.
Did you install airflow with the [ssh] extra so that SSHOperator's dependencies are met (specifically paramiko and sshtunnel) on both the webserver and scheduler? One of them is working because I assume it's parsing and showing up in the UI based on being added to the DB.
What do you get if you execute:
airflow test my-dag-id run-script-an_instance_id 2018-12-12T12:12:12
You know that the scheduler and webserver are looping over refilling their DAG bag and so rerunning this DAG file a few 1000 times a day, reloading that json (it's local access, so similar to importing a module), and recreating that SSHHook with a DB lookup. I don't see anything fancy done setting up this hook, why not just remove ssh_hook from the SSHOperator and replace it with ssh_conn_id='my-conn-id' so it can be created once at execution time?
I doubt that's the issue that's causing the retries that just roll forward though.
I had a task stuck in up_for_retry for almost 24 hours before I noticed it, and it had nothing to do with the start_date, end_date, or any other classic beginner's problem.
I resorted to reading the source code, and found that Airflow treats up_for_retry tasks differently if they are part of a backfill DAG run.
So I connected to the metadata DB and changed backfill_ to scheduled__ in the dag_run row corresponding to the stuck task. Airflow started running the stuck task immediately.
Can I externally(use a http request ?) to mark a specific task_id associated with dag_id and run_id as success/failure.
My task is a long running task on external system and I don't want my task to poll the system to find the status.. since we can probably have several 1000 task running at same time ..
Ideally want my task to
make a http request to start my external job
go to sleep
once the job is finished, it(External system or the post build action of my job) informs airflow that the task is done (identified by task_id, dag_id and run_id)
Thanks
You can solve this by sending SQL queries directly into Airflow's metadata DB:
UPDATE task_instance
SET state = 'success',
try_number = 0
WHERE
task_id = 'YOUR-TASK-ID'
AND
dag_id = 'YOUR-DAG-ID'
AND
execution_date = '2019-06-27T16:56:17.789842+00:00';
Notes:
The execution_date filter is crucial, Airflow identifies DagRuns by execution_date, not really by their run_id. This means you really need to get your DagRun's execution/run date to make it work.
The try_number = 0 part is added because sometimes Airflow will reset the task back to failed if it notices that try_number is already at its limit (max_tries)
You can see it in Airflow's source code here: https://github.com/apache/airflow/blob/750cb7a1a08a71b63af4ea787ae29a99cfe0a8d9/airflow/models/dagrun.py#L203
Airflow doesnt yet have a Rest endpoint. However you have a couple of options
- Using the airflow command line utilities to mark the jobs to success. E.g. In python using Popen.
- Directly update the Airflow DB table task_instance