extract parameters from BashOperator in Airflow - airflow

I have the following operator:
import = BashOperator(
task_id='import',
bash_command="""python3 script.py `{{ var.value.run_value }}` 'file.json'""",
dag=dag)
When I look on Rendered Template I see:
python3 script.py `2018-09-13 11:53:38.725089` 'file.json'
So far great.
However my script doesn't seem to work with this input:
if __name__ == '__main__':
if str(sys.argv[1]):
time_value = str(sys.argv[1])[:-7] # from 2018-09-13 11:01:18.287705 to 2018-09-13 11:01:18
else:
time_value = '1900-01-01 00:00:00'
requestedDate = time_value.split(' ', 1)[0] #From 2018-08-20 15:00:00 get only 2018-08-20
requestedTime = (time_value.split(' ', 1)[1])[:-3] #From 2018-08-20 15:00:00 get only 15:00
pathConfigFile = (sys.argv[2])
This doesn't work.
What I want is:
time_value = YYYY-MM-DD HH:MM:SS
requestedDate = YYYY-MM-DD
requestedTime = HH:MM:SS
pathConfigFile = 2nd parameter given.
Airflow shows me:
{bash_operator.py:101} INFO - SyntaxError: invalid syntax
Also, I can't even print the input.
When I try the code without Airflow as pure python script I don't have any issues.
I should note. Airflow is running under Python 2.7 but the script is executing under Python 3
What is the problem?

You seem to have quoted the macro in back-ticks, which Bash interprets as wanting to execute the contents. You should switch to single quotes.
Your rendered output should look like:
python3 script.py '2018-09-13 11:53:38.725089' 'file.json'

Related

Airflow XCOMs communication from BashOperator to PythonOperator

I'm new to Apache Airflow and trying to write my first Dag which has a task based on another task (using ti.xcom_pull)
PS : I run Airflow in WSL Ubuntu 20.04 using VScode.
I created a task 1 (task_id = "get_datetime") that runs the "date" bash command (and it works)
then I created another task (task_id='process_datetime') which takes the datetime of the first task and processes it, and I set the python_callable and everything is fine..
the issue is that dt = ti.xcom_pull gives a NoneType when I run "airflow tasks test first_ariflow_dag process_datetime 2022-11-1" in the terminal, but when I see the log in the Airflow UI, I find that it works normally.
could someone give me a solution please?
`
from datetime import datetime
from airflow.models import DAG
from airflow.operators.bash import BashOperator
from airflow.operators.python import PythonOperator
def process_datetime(ti):
dt = ti.xcom_pull(task_ids=['get_datetime'])
if not dt :
raise Exception('No datetime value')
dt = str(dt[0]).split()
return{
'year':int(dt[-1]),
'month':dt[1],
'day':int(dt[2]),
'time':dt[3],
'day_of_week':dt[0]
}
with DAG(
dag_id='first_ariflow_dag',
schedule_interval='* * * * *',
start_date=datetime(year=2022, month=11, day=1),
catchup=False
) as dag:
# 1. Get the current datetime
task_get_datetime= BashOperator(
task_id = 'get_datetime',
bash_command='date'
)
# 2. Process the datetime
task_process_datetime= PythonOperator(
task_id = 'process_datetime',
python_callable=process_datetime
)
`
I get this error :
[2022-11-02 00:51:45,420] {taskinstance.py:1851} ERROR - Task failed with exception
Traceback (most recent call last):
File "/mnt/c/Users/Salim/Desktop/A-Learning/Airflow_Conda/airflow_env/lib/python3.8/site-packages/airflow/operators/python.py", line 175, in execute
return_value = self.execute_callable()
File "/mnt/c/Users/Salim/Desktop/A-Learning/Airflow_Conda/airflow_env/lib/python3.8/site-packages/airflow/operators/python.py", line 193, in execute_callable
return self.python_callable(*self.op_args, **self.op_kwargs)
File "/home/salim/airflow/dags/first_dag.py", line 12, in process_datetime
raise Exception('No datetime value')
Exception: No datetime value
According to the documentation, to upload data to xcom you need to set the variable do_xcom_push (Airflow 2) or xcom_push (Airflow 1).
If BaseOperator.do_xcom_push is True, the last line written to stdout
will also be pushed to an XCom when the bash command completes
BashOperator should look like this:
task_get_datetime= BashOperator(
task_id = 'get_datetime',
bash_command='date',
do_xcom_push=True
)

Trigger DAG Run on environment startup/restart

Is there a way to trigger a DAG every time the airflow environment is brought up? This would be helpful to run some tests on the environment
You can moniter the boot time of system or process and check the last_execution datetime of Dag and accordingly trigger the dag.
You can refer to following code
import psutil
from datetime import datetime
last_reboot = psutil.boot_time()
boot_time = datetime.fromtimestamp(last_reboot)
Variable.set('boot_time',boot_time.strftime("%Y-%m-%d %H:%M:%S"))
last_run_dt = Variable.get('last_run_end_date',"")
try:
from dateutil import parser
if last_run_dt == "":
last_run_at = parser.parse('1970-01-01 00:00:00')
else:
last_run_at = parser.parse(last_run_dt)
if boot_time > last_run_at:
# Your code to trigger the Dag

airflow 2.2 timetable for schedule, always with error: timetable not registered

I follow this example
create the example timetable py file, and put it in the $Home/airflow/plugins
create the example dag file, and put it in $Home/airflow/dags
After restart scheduler and webserver, I get DAG import error. In the web UI, the last line of detailed error message:
airflow.exceptions.SerializationError: Failed to serialize DAG 'example_timetable_dag2': Timetable class 'AfterWorkdayTimetable' is not registered
But if I run airflow plugins, I can see the timetable is in the name and source list.
How to fix this error?
Detail of plugins/AfterWorkdayTimetable.py:
from datetime import timedelta
from typing import Optional
from pendulum import Date, DateTime, Time, timezone
from airflow.plugins_manager import AirflowPlugin
from airflow.timetables.base import DagRunInfo, DataInterval, TimeRestriction, Timetable
UTC = timezone("UTC")
class AfterWorkdayTimetable(Timetable):
def infer_data_interval(self, run_after: DateTime) -> DataInterval:
weekday = run_after.weekday()
if weekday in (0, 6): # Monday and Sunday -- interval is last Friday.
days_since_friday = (run_after.weekday() - 4) % 7
delta = timedelta(days=days_since_friday)
else: # Otherwise the interval is yesterday.
delta = timedelta(days=1)
start = DateTime.combine((run_after - delta).date(), Time.min).replace(tzinfo=UTC)
return DataInterval(start=start, end=(start + timedelta(days=1)))
def next_dagrun_info(
self,
*,
last_automated_data_interval: Optional[DataInterval],
restriction: TimeRestriction,
) -> Optional[DagRunInfo]:
if last_automated_data_interval is not None: # There was a previous run on the regular schedule.
last_start = last_automated_data_interval.start
last_start_weekday = last_start.weekday()
if 0 <= last_start_weekday < 4: # Last run on Monday through Thursday -- next is tomorrow.
delta = timedelta(days=1)
else: # Last run on Friday -- skip to next Monday.
delta = timedelta(days=(7 - last_start_weekday))
next_start = DateTime.combine((last_start + delta).date(), Time.min).replace(tzinfo=UTC)
else: # This is the first ever run on the regular schedule.
next_start = restriction.earliest
if next_start is None: # No start_date. Don't schedule.
return None
if not restriction.catchup:
# If the DAG has catchup=False, today is the earliest to consider.
next_start = max(next_start, DateTime.combine(Date.today(), Time.min).replace(tzinfo=UTC))
elif next_start.time() != Time.min:
# If earliest does not fall on midnight, skip to the next day.
next_day = next_start.date() + timedelta(days=1)
next_start = DateTime.combine(next_day, Time.min).replace(tzinfo=UTC)
next_start_weekday = next_start.weekday()
if next_start_weekday in (5, 6): # If next start is in the weekend, go to next Monday.
delta = timedelta(days=(7 - next_start_weekday))
next_start = next_start + delta
if restriction.latest is not None and next_start > restriction.latest:
return None # Over the DAG's scheduled end; don't schedule.
return DagRunInfo.interval(start=next_start, end=(next_start + timedelta(days=1)))
class WorkdayTimetablePlugin(AirflowPlugin):
name = "workday_timetable_plugin"
timetables = [AfterWorkdayTimetable]
Details of dags/test_afterwork_timetable.py:
import datetime
from airflow import DAG
from AfterWorkdayTimetable import AfterWorkdayTimetable
from airflow.operators.dummy import DummyOperator
with DAG(
dag_id="example_workday_timetable",
start_date=datetime.datetime(2021, 1, 1),
timetable=AfterWorkdayTimetable(),
tags=["example", "timetable"],
) as dag:
DummyOperator(task_id="run_this")
If I run airflow plugins:
name | source
==================================+==========================================
workday_timetable_plugin | $PLUGINS_FOLDER/AfterWorkdayTimetable.py
I had similar issue.
Either you need to add __init__.py file or you should try this to debug your issue:
Get all plugin manager objects:
from airflow import plugins_manager
plugins_manager.initialize_timetables_plugins()
plugins_manager.timetable_classes
I got this result: {'quarterly.QuarterlyTimetable': <class 'quarterly.QuarterlyTimetable'>}
Compare your result with exception message. If timetable_classes dictionary has a different plugin name you should either change plugin file path.
You could also try this inside DAG python file:
from AfterWorkdayTimetable import AfterWorkdayTimetable
from airflow import plugins_manager
print(plugins_manager.as_importable_string(AfterWorkdayTimetable))
This would help you find the name that airflow tries to use when searching through timetable_classes dictionary.
You need to register the timetable in "timetables" array via plugin interface. See:
https://airflow.apache.org/docs/apache-airflow/stable/plugins.html
Encountered the same issue.
These are the steps I followed.
Add the Timetable file(custom_tt.py) into plugins folder.
Make sure the plugin folder has _ _ init_ _.py file present in plugins folder.
Change the lazy_load_plugins in airflow.cfg to False.
lazy_load_plugins = False
Add import statement in dagfile as:
from custom_tt import CustomTimeTable
In DAG as
DAG(timetable=CustomTimeTable())
Restart the webserver and scheduler.
Problem fixed.
They have found the resolution to this but doesn't seem like they have updated the documentation to represent the fix just yet.
Your function
def infer_data_interval(self, run_after: DateTime) -> DataInterval:
should be
def infer_manual_data_interval(self, *, run_after: DateTime) -> DataInterval:
See reference:
Apache airflow.timetables.base Documentation
After updating the function with the correct name and extra parameter, everything else should work for you as it did for me.
I was running into this as well. airflow plugins reports that the plugin is registered, and running the DAG script on the command line works fine, but the web UI reports that the plugin is not registered. #Bakuchi's answer pointed me in the right direction.
In my case, the problem was how I was importing the Timetable - airflow apparently expects you to import it relative to the $PLUGINS_FOLDER, not from any other directory, even if that other directory is also on the PYTHONPATH.
For a concrete example:
export PYTHONPATH=/path/to/my/code:$PYTHONPATH
# airflow.cfg
plugins_folder = /path/to/my/code/airflow_plugins
# dag.py
import sys
from airflow_plugins.custom_timetable import CustomTimetable as Bad
from custom_timetable import CustomTimetable as Good
from airflow import plugins_manager
plugins_manager.initialize_timetables_plugins()
print(sys.path) # /path/to/my/code:...:/path/to/my/code/airflow_plugins
print(plugins_manager.as_importable_string(Bad)) # airflow_plugins.custom_timetable.CustomTimetable
print(plugins_manager.as_importable_string(Good)) # custom_timetable.CustomTimetable
print(plugins_manager.timetable_classes) # {'custom_timetable.CustomTimetable': <class 'custom_timetable.CustomTimetable'>}
A bad lookup in plugins_manager.timetable_classes is ultimately what ends up raising the _TimetableNotRegistered error, so the fix is to make the keys match by changing how the timetable is imported.
I submitted a bug report: https://github.com/apache/airflow/issues/21259

Using xcom_push and xcom_pull in python file that called from BashOperator

I saw some similar questions about it (like this and this) but none of them answer this quesiton.
I want to run some python file with BashOperator.
Like this:
my_task = BashOperator(
task_id='my_task',
bash_command='python3 /opt/airflow/dags/programs/my_task.py',
)
Is there a way I can call xcom_push and xcom_pull from my_task.py?
You can either modify it to PythonOperator or pass arguments to the script through bash command using Jinja syntax.
PythonOperator
from programs.my_task import my_function
my_task = PythonOperator(
task_id='my_task',
python_callable=my_function,
)
my_task.py
def my_function(**context):
xcom_value = context['ti'].xcom_pull(task_ids='previous_task'))
context['ti'].xcom_push(key='key', value='value') # this one is pushed to xcom
return "xcom_push_value" # this value is also stored to xcom (xcom_push).
Pass arguments to the python script
my_task = BashOperator(
task_id='my_task',
bash_command='python3 /opt/airflow/dags/programs/my_task.py {{ ti.xcom_pull(task_ids="previous_task") }}'
)
my_task.py
if __name__ == '__main__':
xcom_pulled_value = sys.argv[1]
print("xcom_push_value") # last line of stdout is stored to xcom.
Alternatively, with this approach, you can use argparse.
If you need to use xcoms in a BashOperator and the desire is to pass the arguments to a python script from the xcoms, then I would suggest adding some argparse arguments to the python script then using named arguments and Jinja templating the bash_command. So something like this:
# Assuming you already xcom pushed the variable as "my_xcom_variable"
my_task = BashOperator(
task_id='my_task',
bash_command='python3 /opt/airflow/dags/programs/my_task.py --arg1={{ ti.xcom_pull(key="my_xcom_variable") }}',
)
Then if you are unfamiliar with argparse you can add it at the end of the python script like so:
# ^^^ The rest of your program is up here ^^^
# I have no idea what your python script is,
# just assuming your main program is a function called main_program()
# add as many arguments as you want and name them whatever you want
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser()
parser.add_argument('--arg1')
args = parser.parse_args()
main_program(args_from_the_outside=args.arg1)

Airflow task setup with execution date

I want to customize the task to be weekday dependent in the dag file. It seems the airflow macros like {{ next_execution_date }} are not directly available in the python dag file. This is my dag definition:
RG_TASKS = {
'asia': {
'start_date': pendulum.datetime.(2021,1,1,16,0,tzinfo='Asia/Tokyo'),
'tz': 'Asia/Tokyo',
'files': [
'/path/%Y%m%d/asia_file1.%Y%m%d.csv',
'/path/%Y%m%d/asia_file2.%Y%m%d.csv',
...], },
'euro': {
'start_date': pendulum.datetime.(2021,1,1,16,0,tzinfo='Europe/London'),
'tz': 'Europe/London',
'files': [
'/path/%Y%m%d/euro_file1.%Y%m%d.csv',
'/path/%Y%m%d/euro_file2.%Y%m%d.csv',
...], },
}
dag = DAG(..., start_date=pendulum.datetime.(2021,1,1,16,0,tzinfo='Asia/Tokyo'),
schedule='00 16 * * 0-6')
for rg, t in RG_TASKS.items():
tz = t['tz']
h = t['start_date'].hour
m = t['start_date'].minute
target_time = f'{{{{ next_execution_date.replace(tzinfo="{tz}", hour={h}, minute={m}) }}}}'
time_sensor = DateTimeSensor(dag=dag, task_id=f'wait_for_{rg}', tartget_time=target_time)
bash_task = BashOperator(dag=dag, task_id='load_{rg}', trigger_rule='all_success', depends_on_past=True, bash_command=...)
for fname in t['files']:
fpath = f'{{{{ next_execution_date.strftime("{fname}") }}}}'
task_id = os.path.basename(fname).split('.')[0]
file_sensor = FileSensor(dag=dag, task_id=task_id, filepath=fpath, ...)
file_sensor.set_upstream(time_sensor)
file_sensor.set_downstream(bash_task)
The above works, and the bash_task will be triggered if all files are available, and it is set depend_on_past=True. However, the files have slightly different schedule. {rg}_file1 will be available 6 days/week, except Saturday, while the rest are available 7 days a week.
One option is to create 2 dags, one scheduled to run Sun-Fri, while the other is scheduled to run Sat only. But with this option, the depends_on_past=True is broken on Saturday.
Is there any better way to keep depends_on_past=True 7 days/week? Ideally in the files loop, I could do sth like:
for fname in t['files']:
dt = ...
if dt.weekday()==5 and task_id==f'{rg}_file1':
continue
Generally I think it's better to accomplish things in a single task when it is easy enough to do, and in this case it seem to me you can.
I'm not entirely sure why you are using a datetime sensor, but it does not seem necessary. As far as I can tell, you just want your process to run every day (ideally after the file is there) and skip once per week.
I think we can do away with file sensor too.
Option 1: everything in bash
Check for existence in your bash script and fail (with retries) if missing. Just return non-zero exit code when file missing.
Then in your bash script you could silently do nothing on the skip day.
On skip days, your bash task will be green even though it did nothing.
Option 2: subclass bash operator
Subclass BashOperator and add a skip_day parameter. Then your execute is like this:
def execute(self, context):
next_execution_date = context['next_execution_date']
if next_execution_date.day_of_week == self.skip_day:
raise AirflowSkipException(f'we skip on day {self.skip_day}')
super().execute(context)
With this option your bash script still needs to fail if file missing, but doesn't need to deal with the skip logic. And you'll be able to see that the task skipped in the UI.
Either way, no sensors.
Other note
You can simplify your filename templating.
files=[
'/path/{{ next_ds_nodash }}/euro_file2.{{ next_ds_nodash }}.csv',
...
]
Then you don't need to mess with strftime.

Resources