Airflow skips one scheduled run - airflow

I have various DAGs scheduled, but especially one DAG at a certain run is not being triggered.
I am aware that Airflow runs a job at the end of the period, but surely I'm missing something.
I have a schedule defined as:
10 2,5,8,11,14,17,20,23 * * *, meaning my job should run everyday at 02.10, 05.10, 08.10, 11.10, 14.10, 17.10, 20.10, 23.10 UTC.
For some reason, 23.10 UTC is always skipped, and I don't understand why.
Airflow runs my 20.10 run, skips 23.10, and then continue with 02.10.
So my question is why this run is always skipped.
My default DAG arguments are as follows:
default_args = {
"owner": "whir",
"depends_on_past": False,
"start_date": days_ago(0, hour=0, minute=0, second=0, microsecond=0),
"email": [""],
"email_on_failure": False,
"email_on_retry": False,
"retries": 4,
"retry_delay": timedelta(minutes=30),
}
with DAG(
'transfer-data',
default_args=default_args,
description="Transfer data",
schedule_interval='10 2,5,8,11,14,17,20,23 * * *',
catchup=True
) as dag:
...

Ok my guess for why something's wrong here is that your start_date parameter should be in the DAG definition, not in default_args. Move it out of your default args and instead add it into you DAG definition like:
with DAG(
'transfer-data',
default_args=default_args,
description="Transfer data",
start_date = (your start date)
schedule_interval='10 2,5,8,11,14,17,20,23 * * *',
catchup=True
) as dag:
Airflow is very particular about DAG definitions as it can sometimes cause unexpected behavior in the metadata database on the backend. start_date is a parameter set at the DAG level - you're stating when the DAG should begin. You're not passing it to each individual tasks, which is what default_args should be for.
It's hard to tell just by looking at what you've given us, but my guess is that the start date gets reset around midnight, and that's why it's somehow working for every run other than the 23:10 one.

Related

How to configure Apache Airflow start_date and schedule_interval to run daily at 7am?

I'm using Airflow airflow-2.3.3 (through GCP Composer)
I pass this yaml configuration when deploying my DAG:
dag_args:
dag_id: FTP_DAILY
default_args:
owner: 'Dev team'
start_date: "00:00:00"
max_active_runs: 1
retries: 2
schedule_interval: "0 7 * * *"
ftp_conn_id: 'ftp_dev'
I want this DAG to run at 7am UTC every morning, but it's not running. In the UI it says next run: 2022-11-22, 07:00:00 (as of Nov 22nd) and it never runs. How should I configure my start_date and schedule_interval so that the DAG runs at 7am UTC every day, starting from the nearest 7am after the deployment?
You can pass default args directly in the Python DAG code and calculate yesterday's date, example :
from airflow.utils.dates import days_ago
dag_default_args = {
'depends_on_past': False,
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 0,
'retry_delay': timedelta(minutes=5),
'start_date': days_ago(1)
}
Then in the DAG :
with airflow.DAG(
"dag_name",
default_args=dag_default_args,
schedule_interval="0 7 * * *") as dag:
......
In this case the schedule_interval and cron will work correctly, Airflow will based the cron DAG on the start date.
The main concept of airflow is that the execution of a dag starts after the required interval has passed. If you schedule a dag with the above setup airflow will parse
interval_start_date as 2022-11-22 07:00:00
and interval_end_date as 2022-11-23 07:00:00
As you are requesting airflow to fetch data from this interval it will wait for the interval to pass, thus starting execution on 23rd November 7am.
If you want it to trigger immediately after you deploy the dag you need to move the start date back by one day. You might need to set up the catchup flag to true.
with DAG(
dag_id='new_workflow4',
schedule_interval="0 7 * * *",
start_date=pendulum.datetime(2022, 11, 21, hour=0, tz="UTC"),
catchup=True
) as dag:

Execution of next task in case of failure- airflow

1.4 with composer 2.0.
I have a DAG that runs multiple tasks, the problem I have is that when one task fails, it runs the next one anyway.
According to the airflow documentation this should not be the case, but rather terminate the execution of the DAG after the task fails.
Tasks are dependent, so if one fails, the next will fail.
I want that in case a task fails, the execution of the DAG will be terminated.
default_args = {
'owner': owner,
'start_date': datetime.datetime(2021, 12, 28 ,15 ,0 ,0 ), #2021-08-08 10:00:00 UTC-0
'email': email,
'email_on_failure': True,
'retries': 0, # Retry once before failing the task.
#'on_failure_callback': incident_pg, #ejecuta funcion en caso de que la tarea falle
}
with DAG(dag_id=inst_dag_id,
default_args = default_args,
catchup = True,
max_active_runs = 5,
#schedule_interval = None) as dag: #ejecucion manual
schedule_interval = "0 15 * * *" ) as dag:
Looks like the issue is with task definition. It would be more transparent to have the task code in your question. From what you have in the question, there is no 'trigger_rule' parameter defined and from apache-airflow's definition of BaseOperator, the trigger_rule by default is all_success which means that all the upstream tasks are to be successful before downstream tasks can execute.
Check if you have the trigger rule on the task delete_bq_table as all_done if so, remove that or change it to all_success

Dynamic dags not getting added by scheduler

I am trying to create Dynamic DAGs and then get them to the scheduler. I tried the reference from https://www.astronomer.io/guides/dynamically-generating-dags/ which works well. I changed it a bit as in the below code. Need help in debugging the issue.
I tried
1. Test run the file. The Dag gets executed and the globals() is printing all the DAGs objects. But somehow not listing in the list_dags or in the UI
from datetime import datetime, timedelta
import requests
import json
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.operators.http_operator import SimpleHttpOperator
def create_dag(dag_id,
dag_number,
default_args):
def hello_world_py(*args):
print('Hello World')
print('This is DAG: {}'.format(str(dag_number)))
dag = DAG(dag_id,
schedule_interval="#hourly",
default_args=default_args)
with dag:
t1 = PythonOperator(
task_id='hello_world',
python_callable=hello_world_py,
dag_number=dag_number)
return dag
def fetch_new_dags(**kwargs):
for n in range(1, 10):
print("=====================START=========\n")
dag_id = "abcd_" + str(n)
print (dag_id)
print("\n")
globals()[dag_id] = create_dag(dag_id, n, default_args)
print(globals())
default_args = {
'owner': 'diablo_admin',
'depends_on_past': False,
'start_date': datetime(2019, 8, 8),
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=1),
'trigger_rule': 'none_skipped'
#'schedule_interval': '0 * * * *'
# 'queue': 'bash_queue',
# 'pool': 'backfill',
# 'priority_weight': 10,
# 'end_date': datetime(2016, 1, 1),
}
dag = DAG('testDynDags', default_args=default_args, schedule_interval='*/1 * * * *')
#schedule_interval='*/1 * * * *'
check_for_dags = PythonOperator(dag=dag,
task_id='tst_dyn_dag',
provide_context=True,
python_callable=fetch_new_dags
)
check_for_dags
Expected to create 10 DAGs dynamically and added to the scheduler.
I guess doing the following would fix it
completely remove the global testDynDags dag and tst_dyn_dags task (instantiation and invocation)
invoke your fetch_new_dags(..) method with requisite arguments in global scope
Explanation
Dynamic dags / tasks merely means that you have a well-defined logic at the time of writing dag-definition file that can help create tasks / dags having a known structure in a pre-defined fashion.
You can NOT determine the structure of your DAG at runtime (task execution). So, for instance, you cannot add n identical tasks to your DAG if the upstream task returned an integer value n. But you can iterate over a YAML file containing n segments and generate n tasks / dags.
So clearly, wrapping dag-generation code inside an Airflow task itself makes no sense.
UPDATE-1
From what is indicated in comments, I infer that the requirement dictates that you revise your external source that feeds inputs (how many dags or tasks to create) to your DAG / task-generation script. While this is indeed a complex use-case, but a simple way to achieve this is to create 2 separate DAGs.
One dag runs once in a while and generates the inputs that are stored in an an external resource like Airflow Variable (or any other external store like file / S3 / database etc.)
The second DAG is constructed programmatically by reading that same datasource which was written by the first DAG
You can take inspiration from the Adding DAGs based on Variable value section

Airflow schedule getting skipped if previous task execution takes more time

I have two tasks in my airflow DAG. One triggers an API call ( Http operator ) and another one keeps checking its status using another api ( Http sensor ). This DAG is scheduled to run every hour & 10 minutes. But some times one execution can take long time to finish for example 20 hours. In such cases all the schedules while the previous task is running is not executing.
For example say if I the job at 01:10 takes 10 hours to finish. Schedules 02:10, 03:10, 04:10, ... 11:10 etc which are supposed to run are getting skipped and only the one at 12:10 is executed.
I am using local executor. I am running airflow server & scheduler using below script.
start_server.sh
export AIRFLOW_HOME=./airflow_home;
export AIRFLOW_GPL_UNIDECODE=yes;
export AIRFLOW_CONN_REST_API=http://localhost:5000;
export AIRFLOW_CONN_MANAGEMENT_API=http://localhost:8001;
airflow initdb;
airflow webserver -p 7200;
start_scheduler.sh
export AIRFLOW_HOME=./airflow_home;
# Connection string for connecting to REST interface server
export AIRFLOW_CONN_REST_API=http://localhost:5000;
export AIRFLOW_CONN_MANAGEMENT_API=http://localhost:8001;
#export AIRFLOW__SMTP__SMTP_PASSWORD=**********;
airflow scheduler;
my_dag_file.py
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': airflow.utils.dates.days_ago(2),
'email': admin_email_ids,
'email_on_failure': False,
'email_on_retry': False
}
DAG_ID = 'reconciliation_job_pipeline'
MANAGEMENT_RES_API_CONNECTION_CONFIG = 'management_api'
DA_REST_API_CONNECTION_CONFIG = 'rest_api'
recon_schedule = Variable.get('recon_cron_expression',"10 * * * *")
dag = DAG(DAG_ID, max_active_runs=1, default_args=default_args,
schedule_interval=recon_schedule,
catchup=False)
dag.doc_md = __doc__
spark_job_end_point = conf['sip_da']['spark_job_end_point']
fetch_index_record_count_config_key = conf['reconciliation'][
'fetch_index_record_count']
fetch_index_record_count = SparkJobOperator(
job_id_key='fetch_index_record_count_job',
config_key=fetch_index_record_count_config_key,
exec_id_req=False,
dag=dag,
http_conn_id=DA_REST_API_CONNECTION_CONFIG,
task_id='fetch_index_record_count_job',
data={},
method='POST',
endpoint=spark_job_end_point,
headers={
"Content-Type": "application/json"}
)
job_endpoint = conf['sip_da']['job_resource_endpoint']
fetch_index_record_count_status_job = JobStatusSensor(
job_id_key='fetch_index_record_count_job',
http_conn_id=DA_REST_API_CONNECTION_CONFIG,
task_id='fetch_index_record_count_status_job',
endpoint=job_endpoint,
method='GET',
request_params={'required': 'status'},
headers={"Content-Type": "application/json"},
dag=dag,
poke_interval=15
)
fetch_index_record_count>>fetch_index_record_count_status_job
SparkJobOperator & JobStatusSensor my custom class extending SimpleHttpOperator & HttpSensor.
If I set depends_on_past true will it work as expected?. Another problem I have for this option is some time the status check job will fail. But the next schedule should get trigger. How can I achieve this behavior ?
I think the main discussion point here is what you set is catchup=False, more detail can be found here. So airflow scheduler will skip those task execution and you would see the behavior as you mentioned.
This sounds like you would need to perform catchup if the previous process took longer than expected. You can try to change it catchup=True

How to define a timeout for Apache Airflow DAGs?

I'm using Airflow 1.10.2 but Airflow seems to ignore the timeout I've set for the DAG.
I'm setting a timeout period for the DAG using the dagrun_timeout parameter (e.g. 20 seconds) and I've got a task which takes 2 mins to run, but Airflow marks the DAG as successful!
args = {
'owner': 'me',
'start_date': airflow.utils.dates.days_ago(2),
'provide_context': True,
}
dag = DAG(
'test_timeout',
schedule_interval=None,
default_args=args,
dagrun_timeout=timedelta(seconds=20),
)
def this_passes(**kwargs):
return
def this_passes_with_delay(**kwargs):
time.sleep(120)
return
would_succeed = PythonOperator(
task_id='would_succeed',
dag=dag,
python_callable=this_passes,
email=to,
)
would_succeed_with_delay = PythonOperator(
task_id='would_succeed_with_delay',
dag=dag,
python_callable=this_passes_with_delay,
email=to,
)
would_succeed >> would_succeed_with_delay
No error messages are thrown. Am I using an incorrect parameter?
As stated in the source code:
:param dagrun_timeout: specify how long a DagRun should be up before
timing out / failing, so that new DagRuns can be created. The timeout
is only enforced for scheduled DagRuns, and only once the
# of active DagRuns == max_active_runs.
so this might be expected behavior as you set schedule_interval=None. Here, the idea is rather to make sure a scheduled DAG won't last forever and block subsequent run instances.
Now, you may be interested in the execution_timeout available in all operators.
For example, you could set a 60s timeout on your PythonOperator like this:
would_succeed_with_delay = PythonOperator(task_id='would_succeed_with_delay',
dag=dag,
execution_timeout=timedelta(seconds=60),
python_callable=this_passes_with_delay,
email=to)

Resources