Airflow long running hourly DAG's missing few hours - airflow

My DAG is schduled to run each hour. I'm pulling each hour of data from an s3 source and processing them. Sometimes the task is taking more than an hour to complete. At that time, I'm missing an hour of data.
Example:
1:00pm DAG started and ran for 2 hours. So my next DAG run takes parameter as 3(3pm) missing 2pm data. In other words, how do I call the task and make sure it runs each hour i., 24 times in a day

Here is my DAG
HOUR_PACIFIC = arrow.utcnow().shift(hours=-3).to('US/Pacific').format("HH")
dag = DAG(
DAG_ID,
catchup=False,
default_args=DEFAULT_ARGS,
dagrun_timeout=timedelta(hours=5),
schedule_interval='0 * * * *')
start = DummyOperator(
task_id='Start',
dag=dag)
my_task = EMRStep(emr,
'stg',
HOUR_PACIFIC)
end = DummyOperator(
task_id='End',
dag=dag
)
start >> my_task >> end

You need to pass the catchup=True for the DAG object.

This appears to be a perfect scenario for using TimeDeltaSensor
Note: following code-snippet is just for reference and has NOT been tested
import datetime
from airflow.models import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
from airflow.sensors.time_delta_sensor import TimeDeltaSensor
from airflow.utils.trigger_rule import TriggerRule
# create DAG object
my_dag: DAG = DAG(dag_id="my_dag",
start_date=datetime.datetime(year=2019, month=3, day=11),
schedule_interval="0 0 0 * * *")
# create dummy begin & end tasks
my_begin_task: DummyOperator = DummyOperator(dag=my_dag,
task_id="my_begin_task")
my_end_task: DummyOperator = DummyOperator(dag=my_dag,
task_id="my_end_task",
trigger_rule=TriggerRule.ALL_DONE)
# populate the DAG
for i in range(1, 24, 1):
# create sensors and actual tasks for all hours of the day
my_time_delta_sensor: TimeDeltaSensor = TimeDeltaSensor(dag=my_dag,
task_id=f"my_time_delta_sensor_task_{i}_hours",
delta=datetime.timedelta(hours=i))
my_actual_task: PythonOperator = PythonOperator(dag=my_dag,
task_id=f"my_actual_task_{i}_hours",
python_callable=my_callable
..)
# wire-up tasks together
my_begin_task >> my_time_delta_sensor >> my_actual_task >> my_end_task
References
Apache Airflow: Delay a task for some period of time
Apache Airflow API Reference: TimeDeltaSensor
Cron Expression (Quartz) for a program to run every midnight at 12 am

Related

How to print Airflow time?

Need this info in the log as a print statement click for more info
Assuming you need to get the duration of a DAG in a task in the DAG itself, then you need to put it as last task and need to understand there will be a little difference (cause the duration task is part of the DAG)
Here, an example of simple DAG that in the last task I calculate the duration and put it in the XCOM.
There is a bit difference also between XCOM and Airflow UI because rounding of the numbers.
from datetime import datetime, timedelta
from airflow import DAG
from airflow.decorators import task
from airflow.operators.python import get_current_context
from airflow.sensors.time_delta import TimeDeltaSensor
from airflow.utils import timezone
with DAG(
dag_id="test_dag",
start_date=datetime(2022, 1, 1),
schedule_interval=None,
render_template_as_native_obj=True,
tags=["test"],
) as dag:
#task
def task1():
print("task1")
sleep_task = TimeDeltaSensor(
task_id="sleep",
delta=timedelta(seconds=3),
mode='reschedule'
)
#task(multiple_outputs=True)
def duration_task():
context = get_current_context()
dag_run = context["dag_run"]
execution_date = dag_run.execution_date
now = timezone.make_aware(datetime.utcnow())
duration = now - execution_date
return {
"duration": str(duration),
"start_time": str(dag_run.execution_date),
"end_time": str(now)
}
(task1() >> sleep_task >> duration_task())

Issues with importing airflow.operators.sensors.external_task import ExternalTaskSensor module and triggering external dag

I am trying to trigger multiple external dag dataflow job via master dag.
I plan to use TriggerDagRunOperator and ExternalTaskSensor . I have around 10 dataflow jobs - some are to be executed in sequence and some in parallel .
For example: I want to execute Dag dataflow jobs A,B,C etc from master dag and before execution goes next task I want to ensure the previous dag run has completed. But I am having issues with importing ExternalTaskSensor module.
Is their any alternative path to achieve this ?
Note: Each Dag eg A/B/C has 6- 7 task .Can ExternalTaskSensor check if the last task of dag A has completed before DAG B or C can start.
I Used the below sample code to run dag’s which uses ExternalTaskSensor, I was able to successfully import the ExternalTaskSensor module.
import time
from datetime import datetime, timedelta
from pprint import pprint
from airflow import DAG
from airflow.operators.dagrun_operator import TriggerDagRunOperator
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
from airflow.sensors.external_task_sensor import ExternalTaskSensor
from airflow.utils.state import State
sensors_dag = DAG(
"test_launch_sensors",
schedule_interval=None,
start_date=datetime(2020, 2, 14, 0, 0, 0),
dagrun_timeout=timedelta(minutes=150),
tags=["DEMO"],
)
dummy_dag = DAG(
"test_dummy_dag",
schedule_interval=None,
start_date=datetime(2020, 2, 14, 0, 0, 0),
dagrun_timeout=timedelta(minutes=150),
tags=["DEMO"],
)
def print_context(ds, **context):
pprint(context['conf'])
with dummy_dag:
starts = DummyOperator(task_id="starts", dag=dummy_dag)
empty = PythonOperator(
task_id="empty",
provide_context=True,
python_callable=print_context,
dag=dummy_dag,
)
ends = DummyOperator(task_id="ends", dag=dummy_dag)
starts >> empty >> ends
with sensors_dag:
trigger = TriggerDagRunOperator(
task_id=f"trigger_{dummy_dag.dag_id}",
trigger_dag_id=dummy_dag.dag_id,
conf={"key": "value"},
execution_date="{{ execution_date }}",
)
sensor = ExternalTaskSensor(
task_id="wait_for_dag",
external_dag_id=dummy_dag.dag_id,
external_task_id="ends",
poke_interval=5,
timeout=120,
)
trigger >> sensor
In the above sample code, sensors_dag triggers tasks in dummy_dag using the TriggerDagRunOperator(). The sensors_dag will wait till the completion of the specified external_task in dummy_dag.

How can I run all all neccessary DAGs / Do I need ExternalTaskSensor?

is it possible to run two dags at another time with the externalTaskSensor?
I have two DAGs.
DAG A runs every two hours
10 a.m. (successful)
12a.m. (failed)
2 p.m. (successful)
Dag B is depended on DAG A. DAG B waits for DAG A at 12 a.m and fails, because DAG A failed. But since DAG A was successful at 2 p.m., Dag B should suppose to run.
How can you implement this? With an ExternalTaskSensor?
I just have a small dummy, to try to understand it.
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.sensors import ExternalTaskSensor
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.utils.timezone import datetime
from datetime import datetime, timedelta
import airflow
source_dag = DAG(
dag_id='sensor_dag_source',
start_date = datetime(2020, 1, 20),
schedule_interval='* * * * *'
)
first_task = DummyOperator(task_id='first_task', dag=source_dag)
target_dag = DAG(
dag_id='sensor_dag_target',
start_date = datetime(2020, 1, 20),
schedule_interval='* * * * *'
)
task_sensor = ExternalTaskSensor(
dag=target_dag,
task_id='dag_sensor_source_sensor',
retries=100,
retry_delay=timedelta(seconds=30),
mode='reschedule',
external_dag_id='sensor_dag_source',
external_task_id='first_task'
)
first_task = DummyOperator(task_id='first_task', dag=target_dag)
task_sensor >> first_task
you can try and use TriggerDagRunOperator and trigger DAG B from DAG A
here is a full answer-
In airflow, is there a good way to call another dag's task?
there is another good post about it-
Wiring top-level DAGs together

Airflow - TriggerDagRunOperator Cross Check

I am trying to trigger one dag from another. I am using TriggerDagRunOperator for the same.
I have the following two dags.
Dag 1:
from datetime import datetime
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
from airflow.operators.dagrun_operator import TriggerDagRunOperator
def print_hello():
return 'Hello world!'
dag = DAG('dag_one', description='Simple tutorial DAG',
schedule_interval='0/15 * * * *',
start_date=datetime(2017, 3, 20), catchup=False)
dummy_operator = DummyOperator(task_id='dummy_task', retries=3, dag=dag)
hello_operator = PythonOperator(task_id='hello_task', python_callable=print_hello, dag=dag)
trigger = TriggerDagRunOperator(
task_id="test_trigger_dagrun",
trigger_dag_id="dag_two", # Ensure this equals the dag_id of the DAG to trigger
dag=dag,
)
dummy_operator >> hello_operator >> trigger
Dag 2:
from datetime import datetime
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
def print_hello():
return 'Hello XYZABC!'
dag = DAG('dag_two', description='Simple tutorial DAG',
schedule_interval='0 12 * * *',
start_date=datetime(2017, 3, 20), catchup=False)
dummy_operator = DummyOperator(task_id='dummy_task', retries=3, dag=dag)
hello_operator = PythonOperator(task_id='hello_task', python_callable=print_hello, dag=dag)
dummy_operator >> hello_operator
Going through the webserver, everything seems fine and running (ie: dag one is triggering dag two ).
My question is how to make sure or check that Dag 2 is actually triggered by Dag 1 and it is not triggered because of its schedule or any other manual action.
Basically, where I can find who triggered the Dag or how the Dag was triggered?
If you see Tree view of Dag 1, Dag 2 that was run by Dag 1 is seen as tasks in this view.
If you see Tree view of Dag 2, you can find AIRFLOW_CTX_DAG_RUN_ID=trig__YYYY_MM_DD... in View Log.
If this is scheduled, it should say
AIRFLOW_CTX_DAG_RUN_ID=scheduled__YYYY_MM_DDT...
You can compare the occurrence time of dag2 with the occurrence time of the triggle task in dag1

Run only the latest Airflow DAG

Let's say I would like to run a pretty simple ETL DAG with Airflow:
it checks the last insert time in DB2, and it loads newer rows from DB1 to DB2 if any.
There are some understandable requirements:
It scheduled hourly, the first few runs will last more than 1 hour
eg. the first run should process a month data, and it lasts for 72 hours,
so the second run should process the last 72 hour, it last 7.2 hours,
the third processes 7.2 hours and it finishes within an hour,
and from then on it runs hourly.
While the DAG is running, don't start the next one, skip it instead.
If the time passed the trigger event, and the DAG didn't start, don't start it subsequently.
There are other DAGs as well, the DAGs should be executed independently.
I've found these parameters and operator a little confusing, what is the distinctions between them?
depends_on_past
catchup
backfill
LatestOnlyOperator
Which one should I use, and which LocalExecutor?
Ps. there's already a very similar thread, but it isn't exhausting.
DAG max_active_runs = 1 combined with catchup = False would solve this.
This one satisfies my requirements. The DAG runs in every minute, and my "main" task lasts for 90 seconds, so it should skip every second run.
I've used a ShortCircuitOperator to check whether the current run is the only one at the moment (query in the dag_run table of airflow db), and catchup=False to disable backfilling.
However I cannot utilize properly the LatestOnlyOperator which should do something similar.
DAG file
import os
import sys
from datetime import datetime
import airflow
from airflow import DAG
from airflow.operators.python_operator import PythonOperator, ShortCircuitOperator
import foo
import util
default_args = {
'owner': 'airflow',
'depends_on_past': True,
'start_date': datetime(2018, 2, 13), # or any date in the past
'email': ['services#mydomain.com'],
'email_on_failure': True}
dag = DAG(
'test90_dag',
default_args=default_args,
schedule_interval='* * * * *',
catchup=False)
condition_task = ShortCircuitOperator(
task_id='skip_check',
python_callable=util.is_latest_active_dagrun,
provide_context=True,
dag=dag)
py_task = PythonOperator(
task_id="test90_task",
python_callable=foo.bar,
provide_context=True,
dag=dag)
airflow.utils.helpers.chain(condition_task, py_task)
util.py
import logging
from datetime import datetime
from airflow.hooks.postgres_hook import PostgresHook
def get_num_active_dagruns(dag_id, conn_id='airflow_db'):
# for this you have to set this value in the airflow db
airflow_db = PostgresHook(postgres_conn_id=conn_id)
conn = airflow_db.get_conn()
cursor = conn.cursor()
sql = "select count(*) from public.dag_run where dag_id = '{dag_id}' and state in ('running', 'queued', 'up_for_retry')".format(dag_id=dag_id)
cursor.execute(sql)
num_active_dagruns = cursor.fetchone()[0]
return num_active_dagruns
def is_latest_active_dagrun(**kwargs):
num_active_dagruns = get_num_active_dagruns(dag_id=kwargs['dag'].dag_id)
return (num_active_dagruns == 1)
foo.py
import datetime
import time
def bar(*args, **kwargs):
t = datetime.datetime.now()
execution_date = str(kwargs['execution_date'])
with open("/home/airflow/test.log", "a") as myfile:
myfile.write(execution_date + ' - ' + str(t) + '\n')
time.sleep(90)
with open("/home/airflow/test.log", "a") as myfile:
myfile.write(execution_date + ' - ' + str(t) + ' +90\n')
return 'bar: ok'
Acknowledgement: this answer is based on this blog post.
DAG max_active_runs = 1 combined with catchup = False and add a DUMMY task right at the beginning( sort of START task) with wait_for_downstream=True.
As of LatestOnlyOperator - it will help to avoid reruning a Task if previous execution is not yet finished.
Or create the "START" task as LatestOnlyOperator and make sure all Taks part of 1st processing layer are connecting to it. But pay attention - as per the Docs "Note that downstream tasks are never skipped if the given DAG_Run is marked as externally triggered."

Resources