Airflow task skips a day during backfill - airflow

This is a repost from a discussion in the Airflow repo:
https://github.com/apache/airflow/discussions/23808
I'm seeing unexpected behaviour with backfilling for a dag, using the airflow CLI. After running:
gcloud composer environments run [company environment] --location [location] dags backfill -- --reset-dagruns -s 2022-05-05 -e 2022-05-18 minimal_skip
I observe that 2022-05-05 is complete on the Airflow UI, but the next time it is run, i see 2022-05-07, rather than 2022-05-06 as expected.
import os
import datetime
import airflow
from airflow.providers.google.cloud.operators.bigquery import BigQueryInsertJobOperator
from airflow_dag_utilities import airflow_dag_utilities as dag_utils
def main():
dag_args = {
# DAG
# department-project_name-dag_description (inherit from definition)
"dag_id": os.path.splitext(os.path.basename(__file__))[0],
"schedule_interval": "0 0 * * *",
"max_active_runs": 1,
"dagrun_timeout": datetime.timedelta(minutes=59),
"template_searchpath": ["/home/airflow/gcs/dags/minimal_skip/"],
"catchup": False, # must include because airflow default catchup is True
# Operators
"default_args": {
# 'start_date': datetime.datetime(2020, 1, 13, 15), (year, m, d, hour)
"start_date": None, # time: UTC, uses dag_utils.suggest_start_date
"owner": "xxxxxxx",
"email": ["xxxxxxxxxx#xxxxxxx.com"],
"depends_on_past": False, # new instance will not run if past job failed
"retries": 1,
"retry_delay": datetime.timedelta(seconds=30),
"email_on_failure": True,
"email_on_retry": False,
},
}
dag_args["default_args"]["start_date"] = dag_utils.suggest_start_date(
dag_args["schedule_interval"], intervals=-2
) # remove automatic start_date assignment when using your own
create_dag(dag_args)
def create_dag(dag_args):
with airflow.DAG(**dag_args) as dag:
"""
No logic outside of tasks(operators) or they are constantly run by scheduler
"""
t_select_five = BigQueryInsertJobOperator(
task_id='select_five',
configuration={
'query': {
'query': "SELECT 5;",
'useLegacySql': False,
},
},
)
t_select_ten = BigQueryInsertJobOperator(
task_id='select_10',
configuration={
'query': {
'query': "SELECT 10",
'useLegacySql': False,
},
},
)
t_select_five >> t_select_ten
globals()[dag.dag_id] = dag # keep this
main() # keep this
We see the same issue when depends_on_past=True
Why does Airflow have this behaviour when backfilling? Is this expected behaviour? I thought it was unintuitive especially since there is an option to go 'backwards' sequentially, why wouldn't it go 'forwards' sequentially without skipping days?
Using the Airflow UI to look at dag runs:

Related

Apache airflow scheduler not triggering once in a month task as expected

Here is the dag, which I want to execute on fixed date of every month, as of now I kept it on 18th of every month.
But the task is getting triggered daily by the scheduler. catchup_by_default = False is set to false in airflow.cfg file
default_args = {
'owner': 'muddassir',
'depends_on_past': True,
'start_date': datetime(2021, 3, 18),
'retries': 1,
'schedule_interval': '0 0 18 * *',
}
Image 1
Image 2
Image 3
Image 4
you have mentioned schedule_interval in your default_args, which is not the way to schedule the DAG. default_args are actually applied to the tasks, as they are passed to Operators and not the DAG itself.
you can modify your code as follows, just by removing the schedule_interval from default_args and passing it in the DAG instance as follows and you can set catchup flag as False to avoid any backfills :
# assuming this is how you initialize your DAG
dag = DAG('your DAG UI name', default_args=default_args, schedule_interval = '0 0 18 * *', catchup=False)

Airflow dag not running when scheduled (while others scheduled at same time, do)?

I have 2 dags in airflow, both of which are scheduled to run at 22 UTC time (12PM my time (HST)). I find that only one of these dags is running at this time and am not sure why this would be happening. I can manually start the other dag while the one that works is running, but it just does not start on it's own.
Here is the dag configs for the dag that is running on schedule
default_args = {
'owner': 'me',
'depends_on_past': False,
'start_date': datetime(2019, 10, 13),
'email': [
'me#co.org'
],
'email_on_failure': True,
'retries': 0,
'retry_delay': timedelta(minutes=5),
'max_active_runs': 1,
}
dag = DAG('my_dag_1', default_args=default_args, catchup=False, schedule_interval="0 22 * * *")
Here is the dag configs for the dag that is failing to run
default_args = {
'owner': 'me',
'depends_on_past': False,
'start_date': datetime(2019, 10, 13),
'email': [
'me#co.org',
],
'email_on_failure': True,
'retries': 0,
'retry_delay': timedelta(minutes=5),
}
dag = DAG('my_dag_2', default_args=default_args,
max_active_runs=1,
catchup=False, schedule_interval=f"0 19,22,1 * * *")
# run setup dag and trigger at 9AM, 12PM,and 3PM (need to convert from UTC time (-2 HST))
From the airflow.cfg file, some of the settings that I think are relevant are set as...
# The amount of parallelism as a setting to the executor. This defines
# the max number of task instances that should run simultaneously
# on this airflow installation
#parallelism = 32
parallelism = 8
# The number of task instances allowed to run concurrently by the scheduler
#dag_concurrency = 16
dag_concurrency = 3
# Are DAGs paused by default at creation
dags_are_paused_at_creation = True
# The maximum number of active DAG runs per DAG
#max_active_runs_per_dag = 16
max_active_runs_per_dag = 1
Not sure what could be going on here. Is there some setting that I am mistakenly switching on that stops multiple different dags from running at the same time? Any more debugging info to get to add to this question?

Airflow Scheduling: how to run initial setup task only once?

If my DAG is this
[setup] -> [processing-task] -> [end].
How can I schedule this DAG to run periodically, while running [setup] task only once (on first scheduled run) and skipping it for all later runs?
Check out this post in medium which describes how to implement a "run once" operator. I have successfully used this several times.
Here is a way to do it without need to create a new class. I found this simpler than the accepted answer and it worked well for my use case.
Might be useful for others!
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.dummy import DummyOperator
from airflow.operators.python import BranchPythonOperator
with DAG(
dag_id='your_dag_id',
default_args={
'depends_on_past': False,
'email': ['you#email.com'],
'email_on_failure': True,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
},
description='Dag with initial setup task that only runs on start_date',
start_date=datetime(2000, 1, 1),
# Runs daily at 1 am
schedule_interval='0 1 * * *',
# catchup must be true if start_date is before datetime.now()
catchup=True,
max_active_runs=1,
) as dag:
def branch_fn(**kwargs):
# Have to make sure start_date will equal data_interval_start on first run
# This dag is daily but since the schedule_interval is set to 1 am data_interval_start would be
# 2000-01-01 01:00:00 when it needs to be
# 2000-01-01 00:00:00
date = kwargs['data_interval_start'].replace(hour=0, minute=0, second=0, microsecond=0)
if date == dag.start_date:
return 'initial_task'
else:
return 'skip_initial_task'
branch_task = BranchPythonOperator(
task_id='branch_task',
python_callable=branch_fn,
provide_context=True
)
initial_task = DummyOperator(
task_id="initial_task"
)
skip_initial_task = DummyOperator(
task_id="skip_initial_task"
)
next_task = DummyOperator(
task_id="next_task",
# This is important otherwise next_task would be skipped
trigger_rule="one_success"
)
branch_task >> [initial_task, skip_initial_task] >> next_task

Airflow ExternalTaskSensor Stuck and Error

I have define the external_sensor like that:
external_sensor = ExternalTaskSensor(task_id='ext_sensor_task',
execution_delta=timedelta(minutes=0),
external_dag_id='book_data',
external_task_id='Dataframe_Windows_test',
dag = dag)
The another task is defined like this:
dl_processing_windows = DL_Processing(task_id='dl_processing_windows',
df_dataset_location=dl_config.WINDOWS_DATASET,
....
While in the airflow UI:
I got the error:
Argument ['task_id'] is required
I have two problems:
1. Why does such error exist?
2. Why does it not work?
The attachment:
default_args = {
'owner': 'Newt',
'retries': 2,
'retry_delay': timedelta(seconds=30),
'depends_on_past': False,
}
dag = DAG(
dag_id,
start_date = datetime(2019, 11, 20),
description= 'xxxx',
default_args = default_args,
schedule_interval = timedelta(hours=1),
)
The parameters for dag are the same for both dags!
I fixed it.
Normally, the start_date is different from the scheduler_interval. I set the start_date for both dags into the same time with the current date.
After the first dependent bag finished, the new dag began to work!

airflow does not satisfy task dependencies

I have a simple airflow workflow composed of two tasks. One does make a download of a csv file containing stock data. The other extracts the maximum stock price and write the data to another file.
If I run the first task and then the second everything works fine, instead if execute: airflow run stocks_d get_max_share it fails to meet the dependency.
import csv
from datetime import datetime
from datetime import timedelta
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
import requests
def get_stock_data():
url = "https://app.quotemedia.com/quotetools/getHistoryDownload.csv?&webmasterId=501&startDay=02&startMonth=02&startYear=2002&endDay=02&endMonth=07&endYear=2009&isRanged=false&symbol=APL"
try:
r = requests.get(url)
except requests.RequestException as re:
raise
else:
with open('/tmp/stocks/airflow_stock_data.txt', 'w') as f:
f.write(r.text)
def get_max_share():
stock_data = []
stock_max = {}
with open('/tmp/stocks/airflow_stock_data.txt', 'r') as f:
stock_reader = csv.reader(f)
next(stock_reader, None)
for row in stock_reader:
stock_data.append(row)
for stock in stock_data:
stock_max[stock[2]] = stock[0]
with open('/tmp/stocks/max_stock', 'w') as f:
stock_price = max(stock_max.keys())
stock_max_price_date = stock_max[stock_price]
stock_entry = stock_max_price_date + ' -> ' + stock_price
f.write(stock_entry)
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2017, 5, 30),
'email': ['mainl#domain.io'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
'catchup': False,
}
dag = DAG('stocks_d', default_args=default_args, schedule_interval=timedelta(minutes=5))
task_get_stocks = PythonOperator(task_id='get_stocks', python_callable=get_stock_data, dag=dag)
task_get_max_share = PythonOperator(task_id='get_max_share', python_callable=get_max_share, dag=dag)
task_get_max_share.set_upstream(task_get_stocks)
Any ideas why does that happen ?
$ airflow run stocks_d get_max_share
Above command only run get_max_share task not the previous task before running it.
If you need to check the whole dag running, try below command
$ airflow trigger_dag stocks_d

Resources