Trigger DAG Run on environment startup/restart - airflow

Is there a way to trigger a DAG every time the airflow environment is brought up? This would be helpful to run some tests on the environment

You can moniter the boot time of system or process and check the last_execution datetime of Dag and accordingly trigger the dag.
You can refer to following code
import psutil
from datetime import datetime
last_reboot = psutil.boot_time()
boot_time = datetime.fromtimestamp(last_reboot)
Variable.set('boot_time',boot_time.strftime("%Y-%m-%d %H:%M:%S"))
last_run_dt = Variable.get('last_run_end_date',"")
try:
from dateutil import parser
if last_run_dt == "":
last_run_at = parser.parse('1970-01-01 00:00:00')
else:
last_run_at = parser.parse(last_run_dt)
if boot_time > last_run_at:
# Your code to trigger the Dag

Related

Airflow XCOMs communication from BashOperator to PythonOperator

I'm new to Apache Airflow and trying to write my first Dag which has a task based on another task (using ti.xcom_pull)
PS : I run Airflow in WSL Ubuntu 20.04 using VScode.
I created a task 1 (task_id = "get_datetime") that runs the "date" bash command (and it works)
then I created another task (task_id='process_datetime') which takes the datetime of the first task and processes it, and I set the python_callable and everything is fine..
the issue is that dt = ti.xcom_pull gives a NoneType when I run "airflow tasks test first_ariflow_dag process_datetime 2022-11-1" in the terminal, but when I see the log in the Airflow UI, I find that it works normally.
could someone give me a solution please?
`
from datetime import datetime
from airflow.models import DAG
from airflow.operators.bash import BashOperator
from airflow.operators.python import PythonOperator
def process_datetime(ti):
dt = ti.xcom_pull(task_ids=['get_datetime'])
if not dt :
raise Exception('No datetime value')
dt = str(dt[0]).split()
return{
'year':int(dt[-1]),
'month':dt[1],
'day':int(dt[2]),
'time':dt[3],
'day_of_week':dt[0]
}
with DAG(
dag_id='first_ariflow_dag',
schedule_interval='* * * * *',
start_date=datetime(year=2022, month=11, day=1),
catchup=False
) as dag:
# 1. Get the current datetime
task_get_datetime= BashOperator(
task_id = 'get_datetime',
bash_command='date'
)
# 2. Process the datetime
task_process_datetime= PythonOperator(
task_id = 'process_datetime',
python_callable=process_datetime
)
`
I get this error :
[2022-11-02 00:51:45,420] {taskinstance.py:1851} ERROR - Task failed with exception
Traceback (most recent call last):
File "/mnt/c/Users/Salim/Desktop/A-Learning/Airflow_Conda/airflow_env/lib/python3.8/site-packages/airflow/operators/python.py", line 175, in execute
return_value = self.execute_callable()
File "/mnt/c/Users/Salim/Desktop/A-Learning/Airflow_Conda/airflow_env/lib/python3.8/site-packages/airflow/operators/python.py", line 193, in execute_callable
return self.python_callable(*self.op_args, **self.op_kwargs)
File "/home/salim/airflow/dags/first_dag.py", line 12, in process_datetime
raise Exception('No datetime value')
Exception: No datetime value
According to the documentation, to upload data to xcom you need to set the variable do_xcom_push (Airflow 2) or xcom_push (Airflow 1).
If BaseOperator.do_xcom_push is True, the last line written to stdout
will also be pushed to an XCom when the bash command completes
BashOperator should look like this:
task_get_datetime= BashOperator(
task_id = 'get_datetime',
bash_command='date',
do_xcom_push=True
)

airflow 2.2 timetable for schedule, always with error: timetable not registered

I follow this example
create the example timetable py file, and put it in the $Home/airflow/plugins
create the example dag file, and put it in $Home/airflow/dags
After restart scheduler and webserver, I get DAG import error. In the web UI, the last line of detailed error message:
airflow.exceptions.SerializationError: Failed to serialize DAG 'example_timetable_dag2': Timetable class 'AfterWorkdayTimetable' is not registered
But if I run airflow plugins, I can see the timetable is in the name and source list.
How to fix this error?
Detail of plugins/AfterWorkdayTimetable.py:
from datetime import timedelta
from typing import Optional
from pendulum import Date, DateTime, Time, timezone
from airflow.plugins_manager import AirflowPlugin
from airflow.timetables.base import DagRunInfo, DataInterval, TimeRestriction, Timetable
UTC = timezone("UTC")
class AfterWorkdayTimetable(Timetable):
def infer_data_interval(self, run_after: DateTime) -> DataInterval:
weekday = run_after.weekday()
if weekday in (0, 6): # Monday and Sunday -- interval is last Friday.
days_since_friday = (run_after.weekday() - 4) % 7
delta = timedelta(days=days_since_friday)
else: # Otherwise the interval is yesterday.
delta = timedelta(days=1)
start = DateTime.combine((run_after - delta).date(), Time.min).replace(tzinfo=UTC)
return DataInterval(start=start, end=(start + timedelta(days=1)))
def next_dagrun_info(
self,
*,
last_automated_data_interval: Optional[DataInterval],
restriction: TimeRestriction,
) -> Optional[DagRunInfo]:
if last_automated_data_interval is not None: # There was a previous run on the regular schedule.
last_start = last_automated_data_interval.start
last_start_weekday = last_start.weekday()
if 0 <= last_start_weekday < 4: # Last run on Monday through Thursday -- next is tomorrow.
delta = timedelta(days=1)
else: # Last run on Friday -- skip to next Monday.
delta = timedelta(days=(7 - last_start_weekday))
next_start = DateTime.combine((last_start + delta).date(), Time.min).replace(tzinfo=UTC)
else: # This is the first ever run on the regular schedule.
next_start = restriction.earliest
if next_start is None: # No start_date. Don't schedule.
return None
if not restriction.catchup:
# If the DAG has catchup=False, today is the earliest to consider.
next_start = max(next_start, DateTime.combine(Date.today(), Time.min).replace(tzinfo=UTC))
elif next_start.time() != Time.min:
# If earliest does not fall on midnight, skip to the next day.
next_day = next_start.date() + timedelta(days=1)
next_start = DateTime.combine(next_day, Time.min).replace(tzinfo=UTC)
next_start_weekday = next_start.weekday()
if next_start_weekday in (5, 6): # If next start is in the weekend, go to next Monday.
delta = timedelta(days=(7 - next_start_weekday))
next_start = next_start + delta
if restriction.latest is not None and next_start > restriction.latest:
return None # Over the DAG's scheduled end; don't schedule.
return DagRunInfo.interval(start=next_start, end=(next_start + timedelta(days=1)))
class WorkdayTimetablePlugin(AirflowPlugin):
name = "workday_timetable_plugin"
timetables = [AfterWorkdayTimetable]
Details of dags/test_afterwork_timetable.py:
import datetime
from airflow import DAG
from AfterWorkdayTimetable import AfterWorkdayTimetable
from airflow.operators.dummy import DummyOperator
with DAG(
dag_id="example_workday_timetable",
start_date=datetime.datetime(2021, 1, 1),
timetable=AfterWorkdayTimetable(),
tags=["example", "timetable"],
) as dag:
DummyOperator(task_id="run_this")
If I run airflow plugins:
name | source
==================================+==========================================
workday_timetable_plugin | $PLUGINS_FOLDER/AfterWorkdayTimetable.py
I had similar issue.
Either you need to add __init__.py file or you should try this to debug your issue:
Get all plugin manager objects:
from airflow import plugins_manager
plugins_manager.initialize_timetables_plugins()
plugins_manager.timetable_classes
I got this result: {'quarterly.QuarterlyTimetable': <class 'quarterly.QuarterlyTimetable'>}
Compare your result with exception message. If timetable_classes dictionary has a different plugin name you should either change plugin file path.
You could also try this inside DAG python file:
from AfterWorkdayTimetable import AfterWorkdayTimetable
from airflow import plugins_manager
print(plugins_manager.as_importable_string(AfterWorkdayTimetable))
This would help you find the name that airflow tries to use when searching through timetable_classes dictionary.
You need to register the timetable in "timetables" array via plugin interface. See:
https://airflow.apache.org/docs/apache-airflow/stable/plugins.html
Encountered the same issue.
These are the steps I followed.
Add the Timetable file(custom_tt.py) into plugins folder.
Make sure the plugin folder has _ _ init_ _.py file present in plugins folder.
Change the lazy_load_plugins in airflow.cfg to False.
lazy_load_plugins = False
Add import statement in dagfile as:
from custom_tt import CustomTimeTable
In DAG as
DAG(timetable=CustomTimeTable())
Restart the webserver and scheduler.
Problem fixed.
They have found the resolution to this but doesn't seem like they have updated the documentation to represent the fix just yet.
Your function
def infer_data_interval(self, run_after: DateTime) -> DataInterval:
should be
def infer_manual_data_interval(self, *, run_after: DateTime) -> DataInterval:
See reference:
Apache airflow.timetables.base Documentation
After updating the function with the correct name and extra parameter, everything else should work for you as it did for me.
I was running into this as well. airflow plugins reports that the plugin is registered, and running the DAG script on the command line works fine, but the web UI reports that the plugin is not registered. #Bakuchi's answer pointed me in the right direction.
In my case, the problem was how I was importing the Timetable - airflow apparently expects you to import it relative to the $PLUGINS_FOLDER, not from any other directory, even if that other directory is also on the PYTHONPATH.
For a concrete example:
export PYTHONPATH=/path/to/my/code:$PYTHONPATH
# airflow.cfg
plugins_folder = /path/to/my/code/airflow_plugins
# dag.py
import sys
from airflow_plugins.custom_timetable import CustomTimetable as Bad
from custom_timetable import CustomTimetable as Good
from airflow import plugins_manager
plugins_manager.initialize_timetables_plugins()
print(sys.path) # /path/to/my/code:...:/path/to/my/code/airflow_plugins
print(plugins_manager.as_importable_string(Bad)) # airflow_plugins.custom_timetable.CustomTimetable
print(plugins_manager.as_importable_string(Good)) # custom_timetable.CustomTimetable
print(plugins_manager.timetable_classes) # {'custom_timetable.CustomTimetable': <class 'custom_timetable.CustomTimetable'>}
A bad lookup in plugins_manager.timetable_classes is ultimately what ends up raising the _TimetableNotRegistered error, so the fix is to make the keys match by changing how the timetable is imported.
I submitted a bug report: https://github.com/apache/airflow/issues/21259

Generating airflow DAGs dynamically

I am trying to generate airflow dags using a template in a python code, and using globals() as defined here
To define dag object and saving it. Below is my code :
import datetime as dt
import sys
import airflow
from airflow.models import DAG
from airflow.operators.bash_operator import BashOperator
argumentList = sys.argv
owner = argumentList[1]
dag_name = argumentList[2]
taskID = argumentList[3]
bashCommand = argumentList[4]
default_args = {
'owner': owner,
'start_date': dt.datetime(2019, 6, 1),
'retries': 1,
'retry_delay': dt.timedelta(minutes=5),
}
def dagCreate():
with DAG(dag_name,
default_args=default_args,
schedule_interval=None,
) as dag:
print_hello = BashOperator(task_id=taskID, bash_command=bashCommand)
return dag
globals()[dag_name] = dagCreate()
I have kept this python code outside dag_folder, and executing it as follows :
python bash-dag-generator.py Airflow test_bash_generate auto_bash_task ls
But I don't see any DAG generated in the airflow webserver UI. I am not sure where I am going wrong.
As per the official documentation:
DAGs are defined in standard Python files that are placed in Airflow’s DAG_FOLDER. Airflow will execute the code in each file to dynamically build the DAG objects. You can have as many DAGs as you want, each describing an arbitrary number of tasks. In general, each one should correspond to a single logical workflow.
So unless your code is actually inside the DAG_FOLDER, it will not be registered as a DAG.
The way I have been able to implement Dynamic DAGs is by using Airflow Variable.
In the below example I have a csv file that has list of Bash command like ls, echo etc. As part of the read_file task I am updating the file location to the Airflow Variable. The part where we read the csv file and loop through the commands is where the dynamic DAGs get created.
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.operators.bash_operator import BashOperator
from airflow.models import Variable
from datetime import datetime, timedelta
import csv
'''
Orchestrate the Dynamic Tasks
'''
def read_file_task():
print('I am reading a File and setting variables ')
Variable.set('dynamic-dag-sample','/home/bashoperator.csv')
with DAG('dynamic-dag-sample',
start_date=datetime(2018, 11, 1)) as dag:
read_file_task = PythonOperator(task_id='read_file_task',
python_callable=read_file_task, provide_context=True,
dag=dag)
dynamic_dag_sample_file_path = Variable.get("dynamic-dag-sample")
if dynamic_dag_sample_file_path != None:
with open(dynamic_dag_sample_file_path) as csv_file:
reader = csv.DictReader(csv_file)
line_count = 0
for row in reader:
bash_task = BashOperator(task_id=row['Taskname'], bash_command=row['Command'])
read_file_task.set_downstream(bash_task)

Getting *** Task instance did not exist in the DB as error when running gcs_to_bq in composer

While executing the following python script using cloud-composer, I get *** Task instance did not exist in the DB under the gcs2bq task Log in Airflow
Code:
import datetime
import os
import csv
import pandas as pd
import pip
from airflow import models
#from airflow.contrib.operators import dataproc_operator
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
from airflow.utils import trigger_rule
from airflow.contrib.operators import gcs_to_bq
from airflow.contrib.operators import bigquery_operator
print('''/-------/--------/------/
-------/--------/------/''')
yesterday = datetime.datetime.combine(
datetime.datetime.today() - datetime.timedelta(1),
datetime.datetime.min.time())
default_dag_args = {
# Setting start date as yesterday starts the DAG immediately when it is
# detected in the Cloud Storage bucket.
'start_date': yesterday,
# To email on failure or retry set 'email' arg to your email and enable
# emailing here.
'email_on_failure': False,
'email_on_retry': False,
# If a task fails, retry it once after waiting at least 5 minutes
'retries': 1,
'retry_delay': datetime.timedelta(minutes=5),
'project_id': 'data-rubrics'
#models.Variable.get('gcp_project')
}
try:
# [START composer_quickstart_schedule]
with models.DAG(
'composer_agg_quickstart',
# Continue to run DAG once per day
schedule_interval=datetime.timedelta(days=1),
default_args=default_dag_args) as dag:
# [END composer_quickstart_schedule]
op_start = BashOperator(task_id='Initializing', bash_command='echo Initialized')
#op_readwrite = PythonOperator(task_id = 'ReadAggWriteFile', python_callable=read_data)
op_load = gcs_to_bq.GoogleCloudStorageToBigQueryOperator( \
task_id='gcs2bq',\
bucket='dr-mockup-data',\
source_objects=['sample.csv'],\
destination_project_dataset_table='data-rubrics.sample_bqtable',\
schema_fields = [{'name':'a', 'type':'STRING', 'mode':'NULLABLE'},{'name':'b', 'type':'FLOAT', 'mode':'NULLABLE'}],\
write_disposition='WRITE_TRUNCATE',\
dag=dag)
#op_write = PythonOperator(task_id = 'AggregateAndWriteFile', python_callable=write_data)
op_start >> op_load
UPDATE:
Can you remove dag=dag from gcs2bq task as you are already using with models.DAG and run your dag again?
It might be because you have a dynamic start date. Your start_date should never be dynamic. Read this FAQ: https://airflow.apache.org/faq.html#what-s-the-deal-with-start-date
We recommend against using dynamic values as start_date, especially datetime.now() as it can be quite confusing. The task is triggered once the period closes, and in theory an #hourly DAG would never get to an hour after now as now() moves along.
Make your start_date static or use Airflow utils/macros:
import airflow
args = {
'owner': 'airflow',
'start_date': airflow.utils.dates.days_ago(2),
}
Okay, this was a stupid question on my part and apologies for everyone who wasted time here. I had a Dag running due to which the one I was shooting off was always in the que. Also, I did not write the correct value in destination_project_dataset_table. Thanks and apologies to all who spent time.

Run only the latest Airflow DAG

Let's say I would like to run a pretty simple ETL DAG with Airflow:
it checks the last insert time in DB2, and it loads newer rows from DB1 to DB2 if any.
There are some understandable requirements:
It scheduled hourly, the first few runs will last more than 1 hour
eg. the first run should process a month data, and it lasts for 72 hours,
so the second run should process the last 72 hour, it last 7.2 hours,
the third processes 7.2 hours and it finishes within an hour,
and from then on it runs hourly.
While the DAG is running, don't start the next one, skip it instead.
If the time passed the trigger event, and the DAG didn't start, don't start it subsequently.
There are other DAGs as well, the DAGs should be executed independently.
I've found these parameters and operator a little confusing, what is the distinctions between them?
depends_on_past
catchup
backfill
LatestOnlyOperator
Which one should I use, and which LocalExecutor?
Ps. there's already a very similar thread, but it isn't exhausting.
DAG max_active_runs = 1 combined with catchup = False would solve this.
This one satisfies my requirements. The DAG runs in every minute, and my "main" task lasts for 90 seconds, so it should skip every second run.
I've used a ShortCircuitOperator to check whether the current run is the only one at the moment (query in the dag_run table of airflow db), and catchup=False to disable backfilling.
However I cannot utilize properly the LatestOnlyOperator which should do something similar.
DAG file
import os
import sys
from datetime import datetime
import airflow
from airflow import DAG
from airflow.operators.python_operator import PythonOperator, ShortCircuitOperator
import foo
import util
default_args = {
'owner': 'airflow',
'depends_on_past': True,
'start_date': datetime(2018, 2, 13), # or any date in the past
'email': ['services#mydomain.com'],
'email_on_failure': True}
dag = DAG(
'test90_dag',
default_args=default_args,
schedule_interval='* * * * *',
catchup=False)
condition_task = ShortCircuitOperator(
task_id='skip_check',
python_callable=util.is_latest_active_dagrun,
provide_context=True,
dag=dag)
py_task = PythonOperator(
task_id="test90_task",
python_callable=foo.bar,
provide_context=True,
dag=dag)
airflow.utils.helpers.chain(condition_task, py_task)
util.py
import logging
from datetime import datetime
from airflow.hooks.postgres_hook import PostgresHook
def get_num_active_dagruns(dag_id, conn_id='airflow_db'):
# for this you have to set this value in the airflow db
airflow_db = PostgresHook(postgres_conn_id=conn_id)
conn = airflow_db.get_conn()
cursor = conn.cursor()
sql = "select count(*) from public.dag_run where dag_id = '{dag_id}' and state in ('running', 'queued', 'up_for_retry')".format(dag_id=dag_id)
cursor.execute(sql)
num_active_dagruns = cursor.fetchone()[0]
return num_active_dagruns
def is_latest_active_dagrun(**kwargs):
num_active_dagruns = get_num_active_dagruns(dag_id=kwargs['dag'].dag_id)
return (num_active_dagruns == 1)
foo.py
import datetime
import time
def bar(*args, **kwargs):
t = datetime.datetime.now()
execution_date = str(kwargs['execution_date'])
with open("/home/airflow/test.log", "a") as myfile:
myfile.write(execution_date + ' - ' + str(t) + '\n')
time.sleep(90)
with open("/home/airflow/test.log", "a") as myfile:
myfile.write(execution_date + ' - ' + str(t) + ' +90\n')
return 'bar: ok'
Acknowledgement: this answer is based on this blog post.
DAG max_active_runs = 1 combined with catchup = False and add a DUMMY task right at the beginning( sort of START task) with wait_for_downstream=True.
As of LatestOnlyOperator - it will help to avoid reruning a Task if previous execution is not yet finished.
Or create the "START" task as LatestOnlyOperator and make sure all Taks part of 1st processing layer are connecting to it. But pay attention - as per the Docs "Note that downstream tasks are never skipped if the given DAG_Run is marked as externally triggered."

Resources